Python

argparse pass argument to python script

10 tips for passing arguments to Python script

When writing some Python utility scripts, you may wish to make your script as flexible as possible so that people can use it in different ways. One approach is to parameterize some of the variables originally hard coded in the script, and pass them as arguments to your script. If you have only 1 or 2 arguments, you may find sys.argv is good enough since you can easily access the arguments by the index from argv list. The limitation is also obvious, as you will find it’s difficult to manage when there are more arguments, and some are mandatory and some are optional, also you cannot specify the acceptable data type and add proper description for each argument etc.

In this article, we will be discussing some tips for the argparse package, which provides easier way to manage your input arguments.

To get started, you shall import this package into your script, and try to run with some sample code like below:

import argparse
parser = argparse.ArgumentParser()
parser.add_argument('--foo', help='foo help')
args = parser.parse_args()
print(args)

Customize your prefix_chars

Most of time you would see people use “-” before the argument name, you can change this default behavior to support more prefix characters, such as + or \ etc. To do that, you can specify them in prefix_chars when initializing the argument parser, for instance:

parser = argparse.ArgumentParser(prefix_chars='-+/', description="This is to demonstrate multiple prefix characters")
parser.add_argument("+a", "++add")
parser.add_argument("-s", "--sub")
parser.add_argument("/d", "//dir")
args = parser.parse_args()
print(args)

When you save above as argumentparser.py file and call it with below input arguments, you shall see all the arguments are parsed correctly as per expected:

>>python argumentparser.py +a 1 -s 2 /d python
Namespace(add='1', dir='python', sub='2')

Do take note that, if your argument name contains the prefix character “-“, you may see “-” character being replaced to “_”. For example, your argument name read-only would be replaced to read_only, and you shall use args.read_only to reference the value.

Argument data type

When you are adding new arguments, the default data type is always string, so whatever values followed behind the argument name will be converted into a string. Argument parser supports all immutable data types, so you can specify a data type and let argument parser to validate if correct data type has been passed in. E.g.:

parser.add_argument("-c", "--count", type=int)

You shall see the below validation error if incorrect data type has been passed in:

>>python argumentparser.py -c 1.5
usage: argumentparser.py [-h] [-c COUNT]
argumentparser.py: error: argument -c/--count: invalid int value: '1.5'

Various argument actions

The action keyword in add_argument allows you to specify how you want to handle the arguments when they are passed into the script. Some of the commonly used actions are:

  • store – default behavior
  • store_const – work together with const keyword to store it’s value
  • store_true or store_false – set the argument value to True or False
  • append – allows the same argument to appear multiple times and store the argument values into a list
  • append_const – same as append, but it will store the value from const keyword
  • count – count how many times the argument appears

Below are some examples:

parser.add_argument('-a', '--auto', action="store_true", help="to run automatically")
parser.add_argument("-k", "--kelvin",
                        action="store_const",
                        const=273.15,
                        help="The constant to convert Celsius to Kelvin temperature")

parser.add_argument("-t", "--temperature",
                        type=float,
                        action="append",
                        default=[],
                        help="Celsius temperature to be used in %(prog)s")

parser.add_argument('--age', dest='criteria', action='append_const', const=18)
parser.add_argument('--gender',dest='criteria', action='append_const', const="male")
parser.add_argument("-c", "--count", action="count")

When you run in the command line, you shall see all these arguments are parsed correctly and stored into the respective variables:

>>python argumentparser.py -k -t 35.1 -t 37.5 --age --gender -cc -a
Namespace(auto=True, count=2, criteria=[18, 'male'], kelvin=273.15, temperature=[35.1, 37.5])

In Python version 3.8 and later, you can also extend your own class from argparse.Action and pass it to the action.

Use action=”append” or narg=”+” ?

If you want to collect a list of values from a particular input argument, you have two options:

  • specify action = “append”
  • specify the nargs=”+”

For the below code, both “amount” and “nums” will be able to store a list of values from the input:

parser.add_argument("-a", "--amount",
                        type=float,
                        action="append")
parser.add_argument("-n", "--nums", nargs="+")

The only difference is that, for “append” action, you will need to repeat the argument name whenever you need to add extra values. While for “nargs”, you just need to put all the space separated values after the argument name. E.g.:

>>python argumentparser.py -a 1 -a 2 -n 3 4
Namespace(amount=[1.0, 2.0], nums=['3', '4'])

You may notice that if have any argument with nargs=”+”, it’s better always put it after all the positional arguments, as the argument parser would take your positional argument as part of the previous argument. (see the example in the next tips)

Mixing of positional and optional arguments

When there is no prefix characters used in the argument name, the argument parser will treat it as a positional argument. For instance:

parser.add_argument("caller", help="The process that invoke this script")
parser.add_argument("-c", "--count")

When you check the help for this script, you shall see caller is taken as positional argument.

>>python argumentparser.py -h
usage: argumentparser.py [-h] [-c COUNT] caller

positional arguments:
  caller                The process that invoke this script

optional arguments:
  -h, --help            show this help message and exit
  -c COUNT, --count COUNT

Positional arguments are considered as mandatory, so Python will throw error if they are not specified when calling the script. You can put positional argument at any place of your input argument stream. E.g.:

>>python argumentparser.py -c 2 "cmd.exe"
>>python argumentparser.py "cmd.exe" -c 2
Namespace(caller='cmd.exe', count=2)

Python is smart enough to interpret and assign the values to the correct variables unless there is some confusion when trying to interpret your input arguments, e.g.: If you use nargs to indicate multiple argument values can be passed in:

parser.add_argument("-c", "--count", nargs='+')

And putting your positional argument behind this argument will cause error, because all the values behind “-c” will be taken as the values for “count”

>>python argumentparser.py -c 1 3 "cmd.exe"
usage: argumentparser.py [-h] [-c COUNT [COUNT ...]] caller
argumentparser.py: error: the following arguments are required: caller

Difference between const vs default

const keyword usually works together with action option – store_const or append_const to store the value from the const keyword when the argument appears. If the argument is not supplied, the argument variable will be set as None. Consider the below two arguments:

parser.add_argument("-k", "--kelvin",
                        action="store_const",
                        const=273.15,
                        help="The constant to convert celsius to Kelvin temperature")
parser.add_argument("-c", "--count", default=0)

If you run with below input arguments, you shall the similar result as below:

>>python argumentparser.py -k
Namespace(count=0, kelvin=273.15)
>>python argumentparser.py -c 1
Namespace(count='1', kelvin=None)
>>python argumentparser.py -k 270
usage: argumentparser.py [-h] [-k] [-c COUNT]
argumentparser.py: error: unrecognized arguments: 270

So with const keyword, you basically cannot specify any other values. but still you can add a default value, so that when the argument is supplied, the default value will be set as the default value rather than None.

Mandatory optional argument?

If you would like your optional argument to be mandatory (although it sounds a bit weird), you can specify the required option as True in the add_argument method, e.g.:

parser.add_argument("--data-type", required=True)

With required as True, even you have specified the default option, python will still prompt error saying the argument –data-type is required.

Ignore case in choice option

Image you are implementing some automation scripts to be triggered in various mode, and you would like to limit the options to be accepted for this mode argument, you can specify a list of values in the choices keyword when adding the argument:

parser.add_argument('-m','--mode', choices=['AUTO','SCHEDULER','SEMI-AUTO'])

But you may realize one problem as when you specify “auto” or “Auto”, you would see below error message:

>>python argumentparser.py -m "auto"
usage: argumentparser.py [-h] [-m {AUTO,SCHEDULER,SEMI-AUTO}]
argumentparser.py: error: argument -m/--mode: invalid choice: 'auto' (choose from 'AUTO', 'SCHEDULER', 'SEMI-AUTO')

By default, the argument parser will compare the values in case sensitive manner. To ignore the cases, you can specify a type keyword and transform the input values into upper or lower case:

parser.add_argument('-m','--mode', choices=['AUTO','SCHEDULER','SEMI-AUTO'], type=str.upper)

Conflicting options

Sometimes defining some mutually exclusive arguments can be very useful as you do not wish the two or multiple options to be used at the same time. argparse package also provides a easy way to group these options with necessary validations on the input arguments. For instance, you can group the “auto” and “on-demand” mode into the mutually exclusive group, so that only one mode can be activated at one time:

mode_group = parser.add_mutually_exclusive_group()
mode_group.add_argument('-a', '--auto', action="store_true", help="to run automatically")
mode_group.add_argument('-d', '--on-demand', action="store_true", help="to run on demand")

If both arguments are supplied, you would see the below error message:

>>python argumentparser.py -d -a
usage: argumentparser.py [-h] [-a | -d]
argumentparser.py: error: argument -a/--auto: not allowed with argument -d/--on-demand

Conclusion

argpase package is super useful when you need to write some script to be executed from the command line. In this article, we have reviewed through some tips that might help you to extend your understanding on the different use cases for each individual options argparse provided. If you have more complicated use case, you may want to read further on the official documentation such as the sub-commands and file type etc.

Photo by Aron Visuals on Unsplash

How to calculate date difference between rows in pandas

Problem:

You have some data with date (or numeric) data columns, you already knew you can directly use – operator to calculate the difference between columns, but now you would like to calculate the date difference between the two consecutive rows.

For instance, You have some sample data for GPS tracking, and it has the start and end time at each location (latitude and longitude). You would like to calculate the time gap within each location or between two locations.

import pandas as pd
import numpy as np

df = pd.read_excel("GPS Data.xlsx")
df.head(10)

For a quick view, you can see the sample data output as per below:

pandas calculate date difference between two consecutive rows

Solutions:

  Option 1: Using Series or Data Frame diff

Data frame diff function is the most straightforward way to compare the values between the current row and the previous rows. By default, it compare the current and previous row, and you can also specify the period argument in order to compare the current row and current – period row. To calculate the time gap of the start time between two consecutive rows:

df["Start Time"].diff()

You shall see the below output:

pandas calculate date difference between two consecutive rows

If you check the date type of the result, you may see it is showing as dtype(‘<m8[ns]’), you can convert the result into integer minutes for easier reading.

In this case, you can use the below timedelta from numpy to convert the date difference into minutes:

df["Start Time"].diff().apply(lambda x: x/np.timedelta64(1, 'm')).fillna(0).astype('int64')

You shall see the below output:

pandas calculate date difference between two consecutive rows

You can also select multiple date columns as a data frame to apply the diff function.

  Option 2: Using Series or Data Frame shift with – operator

Shift function allows us to move the values up/down or left /right to the given periods depends on what axis you have specified. You can imagine it is the same as Excel shift cells function.

To calculate the difference between the current and next row, you will need to shift the subtrahend column up 1 cell, below is how to calculate the difference between current End Time and the Start Time from the following row:

df["End Time"] - df["Start Time"].shift(1)

Yous shall see the below result:

pandas calculate date difference between two consecutive rows

If you want to calculate the difference for multiple date columns, you can use the data frame shift operation as per below:

df[["End Time", "Start Time"]] - df[["Start Time", "End Time"]].shift(1)

pandas calculate date difference between two consecutive rows

  Option 3: Using data frame sub

The data frame sub function is self-explanatory by it’s name. You can either apply the subtraction at row level or column level by specifying the aixs argument. For our case, to calculate the date difference between two rows, you can use the original data frame to subtract another data frame which starts from the second row of the original data frame. Below is the code:

(df.loc[:,["Start Time", "End Time"]].sub(df.loc[0,["Start Time", "End Time"]], axis='columns')/np.timedelta64(1, "m")).astype("int64")

You can see the below output:

pandas calculate date difference between two consecutive rows

If you would like to calculate the gap between current End Time and next Start Time, you can use the below:

df["End Time"].sub(df["Start Time"].shift(1))

It should produce the same result as previously when we use – with shift.

Conclusion:

Among the 3 options we discussed above, using diff is the most straightforward approach, but you may notice that it can only apply the calculation on the same columns, if you would like to calculate the difference between the End Time of the current row and the Start Time of the next row, you will have to use sub or – with shift operation. One more difference between diff and sub is that sub has the fill_value argument which supports to substitute the missing values with a default value, so that you do not need another line of code to handle the NA values.

pandas split one row of data into multiple rows

Pandas tricks – split one row of data into multiple rows

As a data scientist or analyst, you will need to spend a lot of time wrangling the data from various sources so that you can have a standard data structure for your further analysis. There are cases that you get the raw data in some sort of summary view and you would need to split one row of data into multiple rows based on certain conditions in order to do grouping and matching from different perspectives. In this article, we will be discussing a solution to solve this particular issue.

Prerequisites:

You will need to get pandas installed if you have not yet. Below is the pip command to install pandas:

pip install pandas

And let’s import the necessary modules and use this sample data for our demonstration, you can download it into your local folder, or just supply this URL link to pandas read_excel method:

import pandas as pd
import numpy as np

df = pd.read_excel("eShop-Delivery-Record.xlsx", sheet_name=0)

So if we do a quick view of the first 5 rows of the data with df.head(5), you would see the below output:

pandas split one row of data into multiple rows

Assume this is the data extracted from a eCommerce system where someone is running a online shop for footwear and apparel products, and the shop provides free 7 days return for the items that it is selling. You can see that each of the rows has the order information, when and who performed the delivery service, and if customer requested return, when the item was returned and by which courier service provider. The data is more from the shop owner’s view, and you may find some difficulty when you want to analyse from courier service providers’ perspective with the current data format. So probably we shall do some transformation to make the format simpler for analysis.

Split one row of data into multiple rows

Now let’s say we would like to split this one row of data into 2 rows if there is a return happening, so that each row has the order info as well as the courier service info and we can easily do some analysis such as calculating the return rate for each product, courier service cost for each month by courier companies, etc.

The output format we would like to have is more like a transaction based, so let’s try to format our date columns and rename the delivery related columns, so that it won’t confuse us later when splitting the data.

df["Delivery Date"] = pd.to_datetime(df["Delivery Date"]).dt.date
df["Return Date"] = pd.to_datetime(df["Return Date"]).dt.date

df.rename(columns={"Delivery Date" : "Transaction Date",
"Delivery Courier" : "Courier",
"Delivery Charges" : "Charges"}, inplace=True)

And we add one more column as transaction type to indicate whether the record is for delivery or return. For now, we just assign it as “DELIVERY” for all records:

df["Transaction Type"] = "DELIVERY"

The rows we need to split are the ones with return info, so let’s create a filter by checking if return date is empty:

flt_returned = ~df["Return Date"].isna()

If you verify the filter with df[flt_returned], you shall see all rows with return info are selected as per below:

pandas split one row of data into multiple rows

To split out the delivery and return info for these rows, we will need to perform the below steps:

  • Duplicate the current 1 row into 2 rows
  • Change the transaction type to “RETURN” for the second duplicated row
  • Copy values of the Return Date, Return Courier, Return Charges to Transaction Date, Courier, Charges respectively

To duplicate these records, we use data frame index.repeat() to repeat these index twice, and then use loc function to get the data for these repeated indexes. Below is the code to create the duplicate records for the rows with return info:

d = df[flt_returned].loc[df[flt_returned].index.repeat(2),:].reset_index(drop=True)

Next, let’s save the duplicated row indexes into a variable, so that we can refer to it multiple times even when some data in the duplicated row changed. We use the data frame duplicated function to return the index of the duplicated rows. For this function, the keep=”first” argument will mark 1st row as non-duplicate and the subsequent rows as duplicate, while keep=”last” will mark the 1st row as duplicate.

idx_duplicate = d.duplicated(keep="first")
#the default value for keep argument is "first", so you can just use d.duplicated()

With this idx_duplicate variable, we can directly update the transaction type for these rows to RETURN:

d.loc[idx_duplicate,"Transaction Type"] = "RETURN"

And next, we shall copy the return info into Transaction Date, Courier, Charges fields for these return records. You can either base on the transaction type value to select rows, or continue to use the idx_duplicate to identify the return records.

Below will copy values from Return Date, Return Courier, Return Charges to Transaction Date, Courier, Charges respectively:

d.loc[idx_duplicate, ["Transaction Date", "Courier", "Charges"]] = d.loc[idx_duplicate, 
                                                     ["Return Date", "Return Courier","Return Charges"]].to_numpy()

If you check the data now, you shall see for the return rows, the return info is already copied over:

pandas split one row of data into multiple rows

(Note: you may want to check here to understand why to_numpy() is needed for swapping columns)

Finally, we need to combine the original rows which only has delivery info with the above processed data. Let’s also sort the values by order number and reset the index:

new_df = pd.concat([d, df[~flt_returned]]).sort_values("Order#").reset_index(drop=True)

Since the return related columns are redundant now, we shall drop these columns to avoid the confusion, so let’s use the below code to drop them by the “Return” keywords in the column labels:

new_df.drop(new_df.filter(regex="Return*", axis=1), axis=1, inplace=True)

(To understand how df.filter works, check my this article)

Once we deleted the redundant columns, you shall see the below final result in the new_df as per below:

pandas split one row of data into multiple rows

So we have successfully transformed our data from a shop owner’s view to courier companies’ view, each of the delivery and return records are now an individual row.

Conclusion

Data wrangling sometimes can be tough depends on what kind of source data you get. In this article, we have gone through a solution to split one row of data into multiple rows by using the pandas index.repeat to duplicate the rows and loc function to swapping the values. There are other possible ways to handle this, please do share your comments in case you have any better idea.

Photo by Luther Bottrill on Unsplash

Why your lambda function does not work

Introduction

Lambda function in Python is designed to be a one-liner and throwaway function even without the needs to assign a function name, so it is also known as anonymous function. Comparing to the normal Python functions, you do not need to write the def and return keywords for lambda function, and it can be defined just at the place where you need it, so it makes your code more concise and looks a bit special. In this article, we will be discussing some unexpected results you may have encountered when you are using lambda function.

Basis usage of lambda

Let’s cover some basis of lambda function before we dive into the problems we are going solve in this article.

Below is the syntax to define lambda function:

lambda [arguments] : expression

As you can see lambda function can be defined with or without arguments, and take note that it only accepts one line of expression, not any of the Python statements. Expressions can be also statements, the difference is that you are able to evaluate a expression into values (or objects), e.g.: 2**2, but you may not be able to evaluate a statement like while(True): into a value. You can think there is an implicit “return” keyword before the expression, so your expression must be eventually computed into a value.

And here are some basic usage of lambda function:

square = lambda x: x**2
print(square(4))
#Output: 16
cryptocurrencies = [('Bitcoin', 10948.52),('Ethereum', 381.41),('Tether', 1.00),
('XRP', 0.249940),
('Bitcoin Cash', 231.86),
('Polkadot', 4.91),
('Binance Coin', 27.02),
('Chainlink', 10.47),
('Litecoin', 48.20),
('EOS', 2.69),
('TRON', 0.027157),
('Neo', 24.29),
('Stellar', 0.077903),
('Huobi Token', 4.91)]

top5_by_name = sorted(cryptocurrencies, key=lambda token: token[0].lower())[0:5]
print(top5_by_name)
#Output: [('Binance Coin', 27.02), ('Bitcoin', 10948.52), ('Bitcoin Cash', 231.86), ('Chainlink', 10.47), ('EOS', 2.69)]

lowest = min(cryptocurrencies, key=lambda token: token[1])
print(lowest)
#Output: ('TRON', 0.027157)

highest = max(cryptocurrencies, key=lambda token: token[1])
print(highest)
#Output: ('Bitcoin', 10948.52)

highest_in_local_currency = lambda exchange_rate: highest[1] * exchange_rate
highest_sgd = highest_in_local_currency(1.38)
print(highest_sgd)
#Output: 15108.9576

You can see that it is quite convenient when you just need a very short function to be supplied to another function which accepts argument like key=keyfunc, such as sorted, list.sort, min, max, heapq.nlargest, heapq.nsmallest, itertool.groupby and so on. The common thing about these use cases is that you do not need very complicated logic (can be written in one line) in the keyfunc and probably you will not reuse it in anywhere else. So it is the ideal scenario to use a lambda function.

Now Let’s expand further on our previous example, assuming the bitcoin price fluctuated a lot on Mon & Tue although it still dominated the market, and you would like to convert the price in SGD in below way:

highest = ('Bitcoin', 10948.52)
mon_highest = lambda exchange_rate: highest[1] * exchange_rate

highest = ('Bitcoin', 10000)
tue_highest = lambda exchange_rate: highest[1] * exchange_rate

print("Mon:", mon_highest(1.36))
print("Tue:", tue_highest(1.36))

You want to assign different values in highest variable to calculate the price in another currency, but you would be surprised when checking the result:

python lambda variable binding

Instead of scratching your head to figure out why it does not work, and let’s try another approach. I am going to create a list of converter functions where I pass in the cryptocurrency pair to calculate the price based on the exchange rate supplied. Later I loop through these functions and print out the converted values:

converters = [lambda exchange_rate: crypto[1] * exchange_rate for crypto in cryptocurrencies]
for c in converters:
    print(c(1.36))

I am expecting to see all the prices are converted into local currency based on the exchange rate 1.36, but when running the above code, it gives below result:

python lambda variable binding

python lambda variable binding output

Same as the previous behaviour, only the last value was used in lambda function. so why it does not work as intended when I use the lambda in this way?

Runtime data binding

When people come into this issue, it is usually due to a fundamental misunderstanding of the variable binding for Python function. For Python function regardless of normal function or lambda function, the variables used in function are bound at runtime not in definition time. So for our first example, the lambda function only used the highest variable stored in locals() at the moment when it is executed.

With this concept cleared, you shall be able to understand the behavior of the output from above two examples, only the latest values at execution time were used in the lambda function.

To fix this issue, we just need a minor change to our original code to pass in the variable in the function definition as default value to the argument. For instance, below is the fix for the first example:

mon_highest = lambda exchange_rate, highest = highest: highest[1] * exchange_rate
tue_highest = lambda exchange_rate, highest = highest: highest[1] * exchange_rate

Below is the fix for the second example:

converters = [lambda exchange_rate, crypto = crypto: crypto[1] * exchange_rate for crypto in cryptocurrencies]

You may wonder why must use lambda in above two examples, indeed they do not necessarily require a lambda function. For the first example, since you need to call the function more than once, you should just use normal function instead just to be more careful when you need any variable from outside of the function.

And for the second example, it can be simply replaced with a list comprehension as per below:

list(map(lambda crypto: crypto[1] * 1.36, cryptocurrencies))

Conclusion:

Lambda function provides convenience for writing tiny functions for the one-time use, and make your code concise. But it is also highly restricted due to the one line of expression, as you cannot use multiple statements, exception handling and conditions etc. Whatever lambda does, you can definitely use a normal function to replace. The only thing matters is about the readability, so you will need to evaluate whether it is the best scenario to use lambda, and bear in mind about the variable binding.

 

python cache

Python cache – the must read tips for code performance

Introduction

Most of us may have experienced the scenarios that we need to implement some computationally expensive logic such as recursive functions or need to read from I/O or network multiple times, these functions typically requires more resources and longer CPU time, and eventually can cause performance issue if handle without care. For such case, you shall always pay special attention to it once you have completed all the functional requirements, as the additional costs on the resources and time may eventually lead to the user experience issue. In this article, I will be sharing how we can make use of the cache mechanism (aka memoization) to improve the code performance.

Prerequisites:

To follow the examples in below, you will need to have requests package installed in your working environment, you may use the below pip command to install:

pip install requests

With this ready, let’s dive into the problem we are going to solve today.

As I mentioned before, the computationally expensive logic such as recursive functions or reading from I/O or network usually have the significant impacts to the runtime, and are always the targets for optimization. Let me illustrate with a specific example, for instance, assume we need to call some external API to get the rates:

import requests
import json

def inquire_rate_online(dimension):
    result = requests.get(f"https://postman-echo.com/get?dim={dimension}")
    if result.status_code == requests.codes.OK:
        data = result.json()
        return data["args"]["dim"]
    return ''

This function needs to make a call through the network and return the result (for demo purpose, this API call just echo back the input as result). If you want to provide this as a service to everybody, there is a high chance that different people inquire the rate with same dimension value. And for this case, you may wish to have the result stored at somewhere after the first person inquired, so that later you can just return this result for the subsequent inquiry rather than making an API call again. With this sort of caching mechanism, it should speed up your code.

Implement cache with global dictionary

For the above example, the most straightforward way to implement a cache is to store the arguments and results in a dictionary, and every time we check this dictionary to see if the key exists before calling the external API. We can implement this logic in a separate function as per below:

cached_rate = {}
def cached_inquire(dim):
    if dim in cached_rate:
        print(f"cached value: {cached_rate[dim]}")
        return cached_rate[dim]
    cached_rate[dim]= inquire_rate_online(dim)
    print(f"result from online : {cached_rate[dim]}")
    return cached_rate[dim]

With this code, you can cache the previous key and result in the dictionary, so that the subsequent calls will be directly returned from the dictionary lookup rather than an external API call. This should dramatically improve your code performance since reading from dictionary is much faster than making an API through the network.

You can quickly test it from Jupyter Notebook with the %time magic:

%time cached_inquire(1)

For the first time you call it, you would see the time taken is over 1 seconds (depends on the network condition):

result from online : 1
Wall time: 1.22 s

When calling it again with the same argument, we should expect our cached result start working:

%time cached_inquire(1)

You can see the total time taken dropped to 997 microseconds for this call, which is over 1200 times faster than previously:

cached value: 1
Wall time: 997 µs

With this additional global dictionary, we can see so much improvement on the performance. But some people may have concern about the additional memory used to hold these values in a dictionary, especially if the result is a huge object such as image file or array. Python has a separate module called weakref which solves this problem.

Implement cache with weakref

Python introduced weakref to allow creating weak reference to the object and then garbage collection is free to destroy the objects whenever needed in order to reuse its memory.

For demonstration purpose, let’s modify our earlier code to return a Rate class instance as the inquiry result:

class Rate():
    def __init__(self, dim, amount):
        self.dim = dim
        self.amount = amount
    def __str__(self):
        return f"{self.dim} , {self.amount}"

def inquire_rate_online(dimension):
    result = requests.get(f"https://postman-echo.com/get?dim={dimension}")
    if result.status_code == requests.codes.OK:
        data = result.json()
        return Rate(float(data["args"]["dim"]), float(data["args"]["dim"]))
    return Rate(0.0,0.0)

And instead of a normal Python dictionary, we will be using WeakValueDictionary to hold a weak reference of the returned objects, below is the updated code:

import weakref

wkrf_cached_rate = weakref.WeakValueDictionary()
def wkrf_cached_inquire(dim):
    if dim in wkrf_cached_rate:
        print(f"cached value: {wkrf_cached_rate[dim]}")
        return wkrf_cached_rate[dim]

    result = inquire_rate_online(dim)
    print(f"result from online : {result}")
    wkrf_cached_rate[dim] = result
    return wkrf_cached_rate[dim]

With the above changes, if you run the wkrf_cached_inquire two times, you shall see the significant improvement on the performance:

python weakref cache

And the dictionary does not hold the instance of the Rate, rather a weak reference of it, so you do not have to worry about the extra memory used, because the garbage collection will reclaim it when it’s needed and meanwhile your dictionary will be automatically updated with the particular entry being removed. So subsequently the program can continue to call the external API like the first time.

If you stop your reading here, you will miss the most important part of this article, because what we have gone through above are good but just not perfect due to the below issues:

  • In the example, we only have 1 argument for the inquire_rate_online function, things are getting tedious if you have more arguments, all these arguments have to be stored as the key for the dictionary. In that case, re-implement the caching as a decorator function probably would be easier
  • Sometimes you do not really want to let garbage collection to determine which values to be cached longer than others, rather you want your cache to follow certain logic, for instance, based on the time from the most recent calls to the least recent calls, aka least recent used, to store the cache

If the least recent used cache mechanism makes sense to your use case, you shall consider to make use of the lru_cache decorator from functools module which will save you a lot of effort to reinvent the wheels.

Cache with lru_cache

The lru_cache accepts two arguments :

  • maxsize to limit the size of the cache, when it is None, the cache can grow without bound
  • typed when set it as True, the arguments of different types will be cached separately, e.g. wkrf_cached_inquire(1) and wkrf_cached_inquire(1.0) will be cached as different entries

With the understanding of the lru_cache, let’s decorate our inquire_rate_online function to have the cache capability:

from functools import lru_cache

@lru_cache(maxsize=None)
def inquire_rate_online(dimension):
    result = requests.get(f"https://postman-echo.com/get?dim={dimension}")
    if result.status_code == requests.codes.OK:
        data = result.json()
        return Rate(float(data["args"]["dim"]), float(data["args"]["dim"]))
    return Rate(0.0,0.0)

If we re-run our inquire_rate_online twice, you can see the same effect as previously in terms of the performance improvement:

Python cache with lru_cache

And with this decorator function, you can also see the how the cache is used. The hits shows no. of calls have been returned from the cached results:

inquire_rate_online.cache_info()
#CacheInfo(hits=1, misses=1, maxsize=None, currsize=1)

Or you can manually clear all the cache to reset the hits and misses to 0:

inquire_rate_online.cache_clear()

Limitations:

Let’s also talk about the limitations of the solutions we discussed above:

  • The cache mechanism works best for the deterministic function meaning by given the same set of inputs, it always returns the same set of results. And you would not benefit much if you try to cache the result of a nondeterministic function, e.g.:
def random_x(x):
    return x*random.randint(1,1000)
  • For keyword arguments, if you swap the position of the keywords, the two calls will be cached as separate entries
  • It only works for the arguments that are immutable data type.

Conclusion

In this article, we have discussed about the different ways of creating cache to improve the code performance whenever you have computational expensive operations or heavy I/O or network reads. Although lru_cache decorator provide a easy and clean solution for creating cache but it would be still better that you understand the underline implementation of cache before we just take and use.

We also discussed about the limitations for these solutions that you may need to take note before implementing. Nevertheless, it would still help you in a lot of scenarios where you can make use of these methods to improve your code performance.