python

pandas convert columns to rows, convert wide to long, pandas melt

Pandas Tips – Convert Columns To Rows

  Introduction

In one of my previous posts – Pandas tricks to split one row of data into multiple rows, we have discussed a solution to split the summary data from one row into multiple rows in order to standardize the data for further analysis. Similarly, there are many scenarios that we have the aggregated data like a Excel pivot table, and we need to unpivot it from wide to long format for better analysis. In this article, I will be sharing with you a few tips to convert columns to rows with pandas DataFrame.

Prerequisites

To run the later code examples, you shall get pandas installed in your working environment. Below is the pip command to install pandas:

pip install pandas

And we will be using the data from this file for the later demonstration, so you may download and examine how the data looks like with below code:

import pandas as pd
import os
data_dir = "c:\\your_download_dir"
df = pd.read_excel(os.path.join(data_dir, "Sample-Data.xlsx"))

You shall see the sample sales data as per below:

pandas convert columns to rows, wide to long format, pandas melt

The sales amount has been summarized by each product in the last 4 columns. With this wide data format, it would be difficult for us to do some analysis, for instance, the top salesman by month by products or the best seller products by month etc.

A better data format should be transforming the product columns into rows so that each single row only represents 1 product and its sales amount. Now let’s start to explore what are the different ways to convert columns to rows with pandas.

Using Pandas Stack Method

The most immediate solution you may think of would be using the stack method as it allows you to stack the columns vertically onto each other and make it into multiple rows.  For our case, we will need to specify the DataFrame index as “Salesman” and “Order Date“, so that the product columns will stack based on this index. For instance:

df.set_index(["Salesman", "Order Date"]).stack()

If you check the result now, you shall see the below output:

pandas convert columns to rows, wide to long format, pandas melt

This is an MultiIndex Series with index name – [‘Salesman’, ‘Order Date’, None], so you can reset the index and  rename the Series name as “Amount”, meanwhile give the name of the “None” index as “Product Desc” to make it more meaningful. E.g.:

df.set_index(["Salesman", "Order Date"])\
    .stack()\
    .reset_index(name='Amount')\
    .rename(columns={'level_2':'Product Desc'})

With the above code, you can see the output similar to below:

pandas convert columns to rows, wide to long format, pandas melt

 

If you do not want to have the 0 sales amount records, you can easily apply a filter to the DataFrame to have cleaner data.

Using Pandas Melt method

The melt method is a very powerful function to unpivot data from wide to long format. It is like the opposite operation to the pivot_table function, so if you are familiar with pivot_table function or the Excel pivot table, you shall be able to understand the parameters easily.

To achieve the same result as per the stack function, we can use the below code with melt method:

df.melt(id_vars=['Salesman', 'Order Date'], 
        value_vars=['Beer', 'Red Wine', 'Whisky', 'White Wine'],
        var_name="Product Desc",
        value_name='Amount')

The id_vars specifies the columns for grouping rows. The value_vars and var_name specify the columns to unpivot and the new column name, and the value_name indicates the name of the value column. To help you better understand this parameters, you can imagine how the data is generated via pivot table in Excel, now it’s the reversing process.

pandas convert columns to rows, wide to long format, pandas melt

 

Using Pandas wide_to_long Method

The wide_to_long method is quite self-explanatory by its name. The method uses pandas.melt under the hood, and it is designed to solve some particular problems. For instance, if your columns names follows certain patterns such as including a year or number or date, you can specify the pattern and extract the info when converting those columns to rows.

Below is the code that generates the same output as our previous examples:

pd.wide_to_long(
    df, 
    stubnames="Amount", 
    i=["Salesman", "Order Date"], 
    j="Product Desc", 
    suffix=r"|Red Wine|White Wine|Whisky|Beer").reset_index()

The stubnames parameter specifies the columns for the values converted from the wide format. And i specifies the columns for grouping the rows, and j is the new column name those stacked columns. Since our product column names does not follow any pattern, in the suffix parameter, we just list out all the product names.

As the wide_to_long returns a MultiIndex DataFrame, we need to reset index to make it flat data structure.

You may not see the power of this function from the above example, but if you look at the below example from its official document, you would understand how wonderful this function is when solving this type of problems.

pandas convert columns to rows, wide to long format, pandas melt, pandas wide_to_long

Performance Consideration

When testing the code performance for the above 3 methods, the wide_to_long method would take significant longer time than the other two methods, and melt seems to be the fastest. But the result may vary for large set of data, so you will need to evaluate again based on your data set.

#timeit for stack method
4.52 ms ± 329 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

#timeit for melt method
3.5 ms ± 238 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

#timeit for wide_to_long method
17.8 ms ± 709 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Conclusion

In this article, we have reviewed through 3 pandas methods to convert columns to rows when you need to unpivot your data or transform it from wide to long format for further analysis. A simple testing shows that melt method performs the best and the wide_to_long takes the longest time, but bear in mind that wide_to_long method has its specific use cases which the other functions may not be able to achieve.

 

20 Useful Tips for Using Python Pip

20 Tips for Using Python Pip

Introduction

Python has become one of the most popular programming languages due to the easy to use syntax as well as the thousands of open-source libraries developed by the Python community. Almost every problem you want to solve, you can find a solution with these third-party libraries, so that you do not need to reinvent the wheels. Majority of these libraries are hosted in the repository called Pypi and you can install these libraries with the Python pip command.

Python pip module helps you to manage the downloading, installation of the packages, and solving the dependency requirements. Although you probably have used pip for some time, you may not spend much time to read through it’s user guide for some of the useful operations. In this article, we have summarize the 20 useful tips for managing Python third party packages with Python pip.

Check the current pip version

Since Python version 3.4, the pip module has been included by default within the Python binary installer, so you do not need to install it separately once you have Python program installed. To check the version of the pip package, you can use the below:

pip --version

Sample output:

Python Pip version

Install package from Pypi

Installing package is very simple with pip command, you can use “install” option followed by one or multiple package names:

pip install requests

By default, pip looks for the latest release and install the latest version for you together with the dependency packages. Sample output as per below:

Python Pip install package

You can also specify the version number of the package to be installed:

py -m pip install pip==21.1.1

Sample output:

Python Pip install package with version number

Pip also supports a list of version specifier such as >=1.2, <2.0, ~=2.0, !=2.0 or ==1.9.* for matching the correct version of the package to be installed.

When you are not in a virtual environment, the package will be installed into the global folder (system-site) by default, you can use the “–user” option to specify the installation folder in user-site in case of any permission issue. E.g.:

pip install --user requests

Output as per below:

Python Pip install package to user-site

Although you can specify your own customized installation path for your different projects, using virtual environment is still the best way to manage dependencies and conflicts.

Show package version and installation location

To check the basic information such as version number or installation location for an existing package, you can use the “show” option:

pip show colorama

You can see the below information about the package:

Python Pip show package version and installation location

And you can also use the “–verbose” mode to display the additional meta info.

List all the packages installed

To list out all the packages installed, you can use the “list” option:

py -m pip list

You shall see the output format similar to below:

Python Pip list packages

You can add a “–user” option to list all packages installed in your user-site, e.g.:

py -m pip list --user

When you are using virtual environment with “–system-site-packages” (allowing virtual environment to access system-site packages), you can use the “list –local” option to show only the packages installed in your virtual environment:

py -m pip list --local

List all the outdated packages

To check if any installed packages are outdated, you can use the “–outdated” option:

py -m pip list -o
# or
py -m pip list --outdated

Below is the sample output:

Python Pip list outdated packages

Upgrade package to the latest version

Once identified the outdated packages, you can manually use the “–upgrade” option to upgrade the package to the latest version. Multiple package name can be specified with whitespaces:

py -m pip install --upgrade pip
#or 
py -m pip install --U pip setuptools

Sample output as per below:

Python Pip upgrade package

Auto upgrade packages to the latest version

Pip does not have an option to auto upgrade the outdated packages, but you can make use of the result from “list -o” and create a simple script to achieve it, e.g.:

#in Windows command line
for /F "skip=2 delims= " %i in ('pip list --o --local') do pip install -U %i

#in linux
pip list --o --local | grep -v '^\-e' | cut -d = -f 1  | xargs -n1 pip install -U

Export installed packages

You can use “freeze” option to export all your installed package names into a text file, so that you can re-create exactly the same project environment in another PC. For instance:

py -m pip freeze -l > requirements_demo.txt

Result in the output text file:

Python Pip install requirement file

Install multiple packages from requirement file

For the packages you’ve exported with “freeze” option, you can re-install all the packages in another environment with the below “-r” option:

py -m pip install -r requirements.txt

You may see the below output when you have package name “numpy” in your requirements.txt file:

Python Pip install package with requirement file

The requirements.txt also allows to include other requirement files. This may be useful when you have a sub module requires extra packages and can be run independently as a separate application. So you may put the common packages in the requirements.txt and the additional packages in the requirements_module1.txt file,  the include the requirements.txt file in your module file.

E.g. the content in the requirements_module1.txt:

#opencv-python
#comment out some packages
python-dateutil

-r requirements.txt

When you run the “install” command:

py -m pip install -r requirements_module1.txt

You shall the sample output as per below:

Python Pip install package with multiple requirement files

Uninstall packages

Uninstalling an existing package can be done with below command:

pip uninstall numpy

Output as per below:

Python Pip uninstall package

Install package from wheel file

When you have a binary wheel file downloaded in your local folder, you can also use the “install” option to install the wheel file directly:

py -m pip install --force-reinstall C:\Users\codef\Downloads\python_dateutil-2.8.2-py2.py3-none-any.whl

Output as per below:

pip install whl

Install package from non-Pypi index

If the package is not hosted in Pypi index, you can manually specify the index url with “–index-url” or simply “-i” :

py -m pip install -i https://mirrors.huaweicloud.com/repository/pypi/simple/ colorama

Above command would download and install the package from huawei cloud repository (a PyPi mirror):

Python Pip install package from Pypi mirrors

This would be also helpful when you are not able to access the Pypi directly due to the firewall or proxy issue in your network, you can find a Pypi mirror repository and download the packages from there. Usually these mirrors synchronize with Pypi in a few minutes interval which should not cause any issue for your development work.

Configure global index url

To permanently save the index url so that you do not have to key in the url for every package installation, you can use the “config” option to set the url globally. e.g:

pip config set global.index-url https://mirrors.aliyun.com/pypi/simple

With the above setting, you can install package from the mirror repository as per normal without specifying the url option.

Check package compatibility

When you manually install the packages, sometimes you may encounter issues that some dependency packages

having incompatible version installed. To check if you have any such issue, you can use the “check” option :

python -m pip check

You may see something similar to below when there is any conflict:

Python Pip check package compatibility

Download package into local folder

You can download the package wheel files into your local folder when you need:

pip download requests -d .\requests

The “-d” option allows you specify the target folder where you want to save the wheel files. You may get multiple wheel files if the package has any dependency packages. (you can use “–no-deps” when you do not want to download the dependency files)

Below is the sample result:

 

Python Pip download wheels file

Install package into local folder

To install the package from a folder, you can use the “-f” with the file path:

pip install requests -f .\requests

This is the same as installing the package from Pypi:

Python Pip install package offline

Conclusion

In this article we have summarized some useful tips for using Python pip to manage the installation and upgrading of the third party packages for your Python projects. For more advanced usage of this module, you may refer to it’s official document.

python datetime

Python datetime – the 9 tips you shall know

Introduction

Dealing with date and time are quite common whenever you are writing Python scripts, for instance, the simplest use cases would be logging some events with a timestamp, or saving a file with date and timing info in the file name. It can be challenging when you have more complicated scenarios such as handling time zone, daylight saving and recurrences etc. The built-in Python datetime module is capable of handling most of the date and time operations, and there are third party libraries can help you to easily manage the time zone and daylight saving challenges. In this article, we will be discussion some tips for using the Python datetime module as well as the third party package dateutil.

Prerequisite

If you do not have dateutil installed yet, you shall install the latest version to your working environment. Below is the pip command to install the package:

pip install python-dateutil

Let’s get started!

Various ways to get current date and time

The top one use cases that you need a Python datetime object is to get the current date or time. There are plenty of ways to get current date and time from Python datetime module, for instance:

>>>from datetime import datetime
>>>import time

#Local timezone
>>>datetime.now()
datetime.datetime(2020, 10, 24, 21, 31, 11, 761666)
>>>datetime.today()
datetime.datetime(2020, 10, 24, 21, 31, 12, 139719)

>>>datetime.fromtimestamp(time.time())
datetime.datetime(2020, 10, 24, 21, 31, 12, 559183)

#Not suggested
>>>datetime.fromtimestamp(time.mktime(time.localtime()))
datetime.datetime(2020, 10, 24, 21, 33, 5)

#UTC timezone
>>>datetime.now(timezone.utc)
datetime.datetime(2020, 10, 24, 13, 31, 13, 443442, tzinfo=datetime.timezone.utc)
>>>datetime.utcnow()
datetime.datetime(2020, 10, 24, 13, 31, 14, 240517)

Most of the above methods will return a date object in local machine time, and the last two methods will get the date and time in UTC time zone.

If you only need the date info, you can discard the time portion by using the date() method as per below:

>>>datetime.now().date()
datetime.date(2020, 10, 24)

Get year, month, day and time from Python datetime

From the datetime object, you can easily get each individual components such as year, month, day, hour etc. Below examples show you how to extract the date and time components from the datetime object, as well as the weekday or week number information:

>>>TODAY = datetime.today()
>>>TODAY.year, TODAY.month, TODAY.day, TODAY.hour, TODAY.minute, TODAY.second, TODAY.microsecond
(2020, 10, 24, 21, 36, 35, 842689)

#Monday is 0 and Sunday is 6
>>>TODAY.weekday()
5
#Monday is 1 and Sunday is 7
>>>TODAY.isoweekday()
6
#Return year, weekno, and weekday
>>>TODAY.isocalendar()
(2020, 43, 6)

Take note on the start day when you get the weekday in numbers, weekday() returns 0 for Monday, while isoweekday() returns 1 for Monday. There are some programming languages use 0 for Sunday, in this case you can use the %w format code to get the weekday number where it starts from 0 as Sunday.

Date plus or minus X days

Very often you will need to do some arithmetic calculation on the dates such as calculating number of days backward or forward from the current date. To do that, you will need to use the timedelta class. Below is the syntax to create a timedelta object, you can specify number of weeks, days, hours, minutes etc. for initialization:

>>>timedelta(days=1, seconds=50, microseconds=1000, milliseconds=1000, minutes=10, hours=6, weeks=1)
datetime.timedelta(days=8, seconds=22251, microseconds=1000)

All the arguments passed to timedelta will be eventually converted into days, seconds and microseconds.

So to calculate today plus 1 day, you can specify the timedelta with 1 day and add it up to the current date:

>>>tomorrow = datetime.today().date() + timedelta(days=1)
datetime.date(2020, 10, 25)

Similarly, calculating the date backwards can be achieved by specifying the arguments as negative numbers:

>>>yesterday = datetime.today().date() + timedelta(days=-1) 
datetime.date(2020, 10, 23)

When calculating the difference between two dates, it will also return a timedelta object:

>>>tomorrow - yesterday
datetime.timedelta(days=2)

Get the first day of the month

With the replace() method, you can replace the year, month or day of the current date and return a new date. The most commonly used scenario would be getting the first day of the month based on current date, e.g.:

>>>datetime.today().date().replace(day=1)
datetime.date(2020, 10, 1)

Format date with strftime and strptime

There are many scenarios that you need to convert a date from string or format a date object into a string. You can easily convert a date into string format with the strftime method, for instance:

>>>datetime.now().strftime("%Y-%b-%d %H:%M:%S")
'2020-Oct-25 20:35:54

And similarly, from string you can use strptime to convert a string object into a date object:

>>>datetime.strptime("Oct 25 2020 08:10:00", "%b %d %Y %H:%M:%S")
datetime.datetime(2020, 10, 25, 8, 10)

You can check here for the full list of the format code supported by strftime and strptime. And do take note that strptime can be much slower than you expected if you are using it in a loop for a large set of data. For such case, you may consider to directly use datatime.datetime(year, month, day) to form the datetime object.

Create time zone aware date

Most of the methods in Python datetime module return time zone naive objects (which means it does not include any timezone info), in case you need some time zone aware objects, you can specify the time zone info when initializing a date/time object, for instance:

>>>singapore_tz = timezone(timedelta(hours=8), name="SGT")
>>>sg_time_now = datetime.now(tz=singapore_tz)
datetime.datetime(2020, 10, 24, 22, 31, 6, 554991, tzinfo=datetime.timezone(datetime.timedelta(seconds=28800), 'SGT'))

If you use 3rd party libraries like pytz or dateutil, you can easily get the time zone info by supplying IANA time zone database name or Windows time zone names. Below is the example for dateutil:

>>>import dateutil

#time zone database name from IANA
>>>sh_tz = dateutil.tz.gettz('Asia/Shanghai')
>>>datetime(2020, 10, 24, 22, tzinfo = sh_tz)
datetime.datetime(2020, 10, 24, 22, 0, tzinfo=tzfile('PRC'))

#windows time zone names
>>>cn_tz = dateutil.tz.gettz('China Standard Time')
>>>datetime(2020, 10, 24, 22, tzinfo = cn_tz)
datetime.datetime(2020, 10, 24, 22, 0, tzinfo=tzwin('China Standard Time'))

With the time zone database, you do not need to worry about the offset hours, and you only need to provide the name to get the correct date and time in the respective time zone.

Get a date by relative period

If you use timedelta to get a date from the current date plus a relative period such as 1 year or 1 month, you may sometimes run into problems during the leap years. For instance, the below returns Apr 30 as year 2020 is leap year, and the number of days shall be 366 rather than 365.

>>>datetime(2019, 5, 1) + timedelta(days=365)
datetime.datetime(2020, 4, 30, 0, 0)

The simply way to get the correct result is to use the relativedelta from the dateutil package, e.g.:

>>>from dateutil.relativedelta import relativedelta

>>>datetime(2019, 5, 1) + relativedelta(years=1)
datetime.datetime(2020, 5, 1, 0, 0)

You can also specify the other arguments such as the months, days and hours:

>>>datetime.today() + relativedelta(years=1, months=1, days=10, hours=10)
datetime.datetime(2021, 12, 5, 8, 49, 31, 386813)

To get the date of the next Sat from the current date, you can use :

>>>datetime.today() + relativedelta(weekday=calendar.SATURDAY)
datetime.datetime(2020, 10, 24, 15, 16, 10, 502191)

Take note that if you are running it on Saturday before 23:59:59, it will just return the current date, so it is actually returning the nearest Saturday from your current date.

List out all the weekdays

In case you need to get all the weekdays starting from a particular date, you can make use of the recurrence rules from dateutil package.

For instance, the below rrule specifies to recur on daily basis for Mon to Fri with a start and end date:

>>>from dateutil.rrule import rrule, DAILY, MO, TU, WE, TH, FR
>>>from dateutil.parser import parse

>>>list(rrule(DAILY, interval=1, byweekday=[MO, TU, WE, TH, FR], dtstart=datetime.now().date(), until=datetime(2020, 11, 2)))

[datetime.datetime(2020, 10, 26, 0, 0),
 datetime.datetime(2020, 10, 27, 0, 0),
 datetime.datetime(2020, 10, 28, 0, 0),
 datetime.datetime(2020, 10, 29, 0, 0),
 datetime.datetime(2020, 10, 30, 0, 0),
 datetime.datetime(2020, 11, 2, 0, 0)]

The frequency and interval arguments determine the frequency of the recurrence, and the byweekday and dtstart further constrain which are the dates to be selected.

Besides the weekday argument, you can also specify by year, month, hour, minute etc. You can check here for all the available arguments supported for instantiating the rrule object.

Another example, the below code returns a list of dates recurring on 9:15am every another day:

>>>list(rrule(DAILY, interval=2, byminute=15, count=4, dtstart=parse("20201024T090000")))
[datetime.datetime(2020, 10, 24, 9, 15),
 datetime.datetime(2020, 10, 26, 9, 15),
 datetime.datetime(2020, 10, 28, 9, 15),
 datetime.datetime(2020, 10, 30, 9, 15)]

Get a list of business days

Sometimes you would need to exclude the public holidays to get only the business days. To do so, you may first get a list of holidays from another 3rd party libraries like holidays or simply put all holidays into some config file, and then exclude these dates from rrule. For instance:

holidays = [
    datetime(2020, 7, 10,),
    datetime(2020, 7, 31,),
    datetime(2020, 8, 10,)
]
r = rrule(DAILY, interval=1, byweekday=[MO, TU, WE, TH, FR],
   dtstart=datetime(2020, 7, 10), until=datetime(2020, 8, 1))

rs = rrule.rruleset()
rs.rrule(r)

for d in holidays:
    rs.exdate(d)

print(list(rs))

You can see the public holidays have been excluded from the return results:

python datetime - dateutil output

Conclusion

Working with date sometimes can be tough especially when you need to manipulate the dates in different time zones or considering the daylight savings. Luckily with Python datetime and other 3rd party libraries like dateutil, things are getting easier. But you will still need to be very careful when handling dates with time zone or setting up recurrence rules in local time.

Thanks for reading, and you can find other Python related topics from here.

argparse pass argument to python script

10 tips for passing arguments to Python script

When writing some Python utility scripts, you may wish to make your script as flexible as possible so that people can use it in different ways. One approach is to parameterize some of the variables originally hard coded in the script, and pass them as arguments to your script. If you have only 1 or 2 arguments, you may find sys.argv is good enough since you can easily access the arguments by the index from argv list. The limitation is also obvious, as you will find it’s difficult to manage when there are more arguments, and some are mandatory and some are optional, also you cannot specify the acceptable data type and add proper description for each argument etc.

In this article, we will be discussing some tips for the argparse package, which provides easier way to manage your input arguments.

To get started, you shall import this package into your script, and try to run with some sample code like below:

import argparse
parser = argparse.ArgumentParser()
parser.add_argument('--foo', help='foo help')
args = parser.parse_args()
print(args)

Customize your prefix_chars

Most of time you would see people use “-” before the argument name, you can change this default behavior to support more prefix characters, such as + or \ etc. To do that, you can specify them in prefix_chars when initializing the argument parser, for instance:

parser = argparse.ArgumentParser(prefix_chars='-+/', description="This is to demonstrate multiple prefix characters")
parser.add_argument("+a", "++add")
parser.add_argument("-s", "--sub")
parser.add_argument("/d", "//dir")
args = parser.parse_args()
print(args)

When you save above as argumentparser.py file and call it with below input arguments, you shall see all the arguments are parsed correctly as per expected:

>>python argumentparser.py +a 1 -s 2 /d python
Namespace(add='1', dir='python', sub='2')

Do take note that, if your argument name contains the prefix character “-“, you may see “-” character being replaced to “_”. For example, your argument name read-only would be replaced to read_only, and you shall use args.read_only to reference the value.

Argument data type

When you are adding new arguments, the default data type is always string, so whatever values followed behind the argument name will be converted into a string. Argument parser supports all immutable data types, so you can specify a data type and let argument parser to validate if correct data type has been passed in. E.g.:

parser.add_argument("-c", "--count", type=int)

You shall see the below validation error if incorrect data type has been passed in:

>>python argumentparser.py -c 1.5
usage: argumentparser.py [-h] [-c COUNT]
argumentparser.py: error: argument -c/--count: invalid int value: '1.5'

Various argument actions

The action keyword in add_argument allows you to specify how you want to handle the arguments when they are passed into the script. Some of the commonly used actions are:

  • store – default behavior
  • store_const – work together with const keyword to store it’s value
  • store_true or store_false – set the argument value to True or False
  • append – allows the same argument to appear multiple times and store the argument values into a list
  • append_const – same as append, but it will store the value from const keyword
  • count – count how many times the argument appears

Below are some examples:

parser.add_argument('-a', '--auto', action="store_true", help="to run automatically")
parser.add_argument("-k", "--kelvin",
                        action="store_const",
                        const=273.15,
                        help="The constant to convert Celsius to Kelvin temperature")

parser.add_argument("-t", "--temperature",
                        type=float,
                        action="append",
                        default=[],
                        help="Celsius temperature to be used in %(prog)s")

parser.add_argument('--age', dest='criteria', action='append_const', const=18)
parser.add_argument('--gender',dest='criteria', action='append_const', const="male")
parser.add_argument("-c", "--count", action="count")

When you run in the command line, you shall see all these arguments are parsed correctly and stored into the respective variables:

>>python argumentparser.py -k -t 35.1 -t 37.5 --age --gender -cc -a
Namespace(auto=True, count=2, criteria=[18, 'male'], kelvin=273.15, temperature=[35.1, 37.5])

In Python version 3.8 and later, you can also extend your own class from argparse.Action and pass it to the action.

Use action=”append” or narg=”+” ?

If you want to collect a list of values from a particular input argument, you have two options:

  • specify action = “append”
  • specify the nargs=”+”

For the below code, both “amount” and “nums” will be able to store a list of values from the input:

parser.add_argument("-a", "--amount",
                        type=float,
                        action="append")
parser.add_argument("-n", "--nums", nargs="+")

The only difference is that, for “append” action, you will need to repeat the argument name whenever you need to add extra values. While for “nargs”, you just need to put all the space separated values after the argument name. E.g.:

>>python argumentparser.py -a 1 -a 2 -n 3 4
Namespace(amount=[1.0, 2.0], nums=['3', '4'])

You may notice that if have any argument with nargs=”+”, it’s better always put it after all the positional arguments, as the argument parser would take your positional argument as part of the previous argument. (see the example in the next tips)

Mixing of positional and optional arguments

When there is no prefix characters used in the argument name, the argument parser will treat it as a positional argument. For instance:

parser.add_argument("caller", help="The process that invoke this script")
parser.add_argument("-c", "--count")

When you check the help for this script, you shall see caller is taken as positional argument.

>>python argumentparser.py -h
usage: argumentparser.py [-h] [-c COUNT] caller

positional arguments:
  caller                The process that invoke this script

optional arguments:
  -h, --help            show this help message and exit
  -c COUNT, --count COUNT

Positional arguments are considered as mandatory, so Python will throw error if they are not specified when calling the script. You can put positional argument at any place of your input argument stream. E.g.:

>>python argumentparser.py -c 2 "cmd.exe"
>>python argumentparser.py "cmd.exe" -c 2
Namespace(caller='cmd.exe', count=2)

Python is smart enough to interpret and assign the values to the correct variables unless there is some confusion when trying to interpret your input arguments, e.g.: If you use nargs to indicate multiple argument values can be passed in:

parser.add_argument("-c", "--count", nargs='+')

And putting your positional argument behind this argument will cause error, because all the values behind “-c” will be taken as the values for “count”

>>python argumentparser.py -c 1 3 "cmd.exe"
usage: argumentparser.py [-h] [-c COUNT [COUNT ...]] caller
argumentparser.py: error: the following arguments are required: caller

Difference between const vs default

const keyword usually works together with action option – store_const or append_const to store the value from the const keyword when the argument appears. If the argument is not supplied, the argument variable will be set as None. Consider the below two arguments:

parser.add_argument("-k", "--kelvin",
                        action="store_const",
                        const=273.15,
                        help="The constant to convert celsius to Kelvin temperature")
parser.add_argument("-c", "--count", default=0)

If you run with below input arguments, you shall the similar result as below:

>>python argumentparser.py -k
Namespace(count=0, kelvin=273.15)
>>python argumentparser.py -c 1
Namespace(count='1', kelvin=None)
>>python argumentparser.py -k 270
usage: argumentparser.py [-h] [-k] [-c COUNT]
argumentparser.py: error: unrecognized arguments: 270

So with const keyword, you basically cannot specify any other values. but still you can add a default value, so that when the argument is supplied, the default value will be set as the default value rather than None.

Mandatory optional argument?

If you would like your optional argument to be mandatory (although it sounds a bit weird), you can specify the required option as True in the add_argument method, e.g.:

parser.add_argument("--data-type", required=True)

With required as True, even you have specified the default option, python will still prompt error saying the argument –data-type is required.

Ignore case in choice option

Image you are implementing some automation scripts to be triggered in various mode, and you would like to limit the options to be accepted for this mode argument, you can specify a list of values in the choices keyword when adding the argument:

parser.add_argument('-m','--mode', choices=['AUTO','SCHEDULER','SEMI-AUTO'])

But you may realize one problem as when you specify “auto” or “Auto”, you would see below error message:

>>python argumentparser.py -m "auto"
usage: argumentparser.py [-h] [-m {AUTO,SCHEDULER,SEMI-AUTO}]
argumentparser.py: error: argument -m/--mode: invalid choice: 'auto' (choose from 'AUTO', 'SCHEDULER', 'SEMI-AUTO')

By default, the argument parser will compare the values in case sensitive manner. To ignore the cases, you can specify a type keyword and transform the input values into upper or lower case:

parser.add_argument('-m','--mode', choices=['AUTO','SCHEDULER','SEMI-AUTO'], type=str.upper)

Conflicting options

Sometimes defining some mutually exclusive arguments can be very useful as you do not wish the two or multiple options to be used at the same time. argparse package also provides a easy way to group these options with necessary validations on the input arguments. For instance, you can group the “auto” and “on-demand” mode into the mutually exclusive group, so that only one mode can be activated at one time:

mode_group = parser.add_mutually_exclusive_group()
mode_group.add_argument('-a', '--auto', action="store_true", help="to run automatically")
mode_group.add_argument('-d', '--on-demand', action="store_true", help="to run on demand")

If both arguments are supplied, you would see the below error message:

>>python argumentparser.py -d -a
usage: argumentparser.py [-h] [-a | -d]
argumentparser.py: error: argument -a/--auto: not allowed with argument -d/--on-demand

Conclusion

argpase package is super useful when you need to write some script to be executed from the command line. In this article, we have reviewed through some tips that might help you to extend your understanding on the different use cases for each individual options argparse provided. If you have more complicated use case, you may want to read further on the official documentation such as the sub-commands and file type etc.

python cache

Python cache – the must read tips for code performance

Introduction

Most of us may have experienced the scenarios that we need to implement some computationally expensive logic such as recursive functions or need to read from I/O or network multiple times, these functions typically requires more resources and longer CPU time, and eventually can cause performance issue if handle without care. For such case, you shall always pay special attention to it once you have completed all the functional requirements, as the additional costs on the resources and time may eventually lead to the user experience issue. In this article, I will be sharing how we can make use of the cache mechanism (aka memoization) to improve the code performance.

Prerequisites:

To follow the examples in below, you will need to have requests package installed in your working environment, you may use the below pip command to install:

pip install requests

With this ready, let’s dive into the problem we are going to solve today.

As I mentioned before, the computationally expensive logic such as recursive functions or reading from I/O or network usually have the significant impacts to the runtime, and are always the targets for optimization. Let me illustrate with a specific example, for instance, assume we need to call some external API to get the rates:

import requests
import json

def inquire_rate_online(dimension):
    result = requests.get(f"https://postman-echo.com/get?dim={dimension}")
    if result.status_code == requests.codes.OK:
        data = result.json()
        return data["args"]["dim"]
    return ''

This function needs to make a call through the network and return the result (for demo purpose, this API call just echo back the input as result). If you want to provide this as a service to everybody, there is a high chance that different people inquire the rate with same dimension value. And for this case, you may wish to have the result stored at somewhere after the first person inquired, so that later you can just return this result for the subsequent inquiry rather than making an API call again. With this sort of caching mechanism, it should speed up your code.

Implement cache with global dictionary

For the above example, the most straightforward way to implement a cache is to store the arguments and results in a dictionary, and every time we check this dictionary to see if the key exists before calling the external API. We can implement this logic in a separate function as per below:

cached_rate = {}
def cached_inquire(dim):
    if dim in cached_rate:
        print(f"cached value: {cached_rate[dim]}")
        return cached_rate[dim]
    cached_rate[dim]= inquire_rate_online(dim)
    print(f"result from online : {cached_rate[dim]}")
    return cached_rate[dim]

With this code, you can cache the previous key and result in the dictionary, so that the subsequent calls will be directly returned from the dictionary lookup rather than an external API call. This should dramatically improve your code performance since reading from dictionary is much faster than making an API through the network.

You can quickly test it from Jupyter Notebook with the %time magic:

%time cached_inquire(1)

For the first time you call it, you would see the time taken is over 1 seconds (depends on the network condition):

result from online : 1
Wall time: 1.22 s

When calling it again with the same argument, we should expect our cached result start working:

%time cached_inquire(1)

You can see the total time taken dropped to 997 microseconds for this call, which is over 1200 times faster than previously:

cached value: 1
Wall time: 997 µs

With this additional global dictionary, we can see so much improvement on the performance. But some people may have concern about the additional memory used to hold these values in a dictionary, especially if the result is a huge object such as image file or array. Python has a separate module called weakref which solves this problem.

Implement cache with weakref

Python introduced weakref to allow creating weak reference to the object and then garbage collection is free to destroy the objects whenever needed in order to reuse its memory.

For demonstration purpose, let’s modify our earlier code to return a Rate class instance as the inquiry result:

class Rate():
    def __init__(self, dim, amount):
        self.dim = dim
        self.amount = amount
    def __str__(self):
        return f"{self.dim} , {self.amount}"

def inquire_rate_online(dimension):
    result = requests.get(f"https://postman-echo.com/get?dim={dimension}")
    if result.status_code == requests.codes.OK:
        data = result.json()
        return Rate(float(data["args"]["dim"]), float(data["args"]["dim"]))
    return Rate(0.0,0.0)

And instead of a normal Python dictionary, we will be using WeakValueDictionary to hold a weak reference of the returned objects, below is the updated code:

import weakref

wkrf_cached_rate = weakref.WeakValueDictionary()
def wkrf_cached_inquire(dim):
    if dim in wkrf_cached_rate:
        print(f"cached value: {wkrf_cached_rate[dim]}")
        return wkrf_cached_rate[dim]

    result = inquire_rate_online(dim)
    print(f"result from online : {result}")
    wkrf_cached_rate[dim] = result
    return wkrf_cached_rate[dim]

With the above changes, if you run the wkrf_cached_inquire two times, you shall see the significant improvement on the performance:

python weakref cache

And the dictionary does not hold the instance of the Rate, rather a weak reference of it, so you do not have to worry about the extra memory used, because the garbage collection will reclaim it when it’s needed and meanwhile your dictionary will be automatically updated with the particular entry being removed. So subsequently the program can continue to call the external API like the first time.

If you stop your reading here, you will miss the most important part of this article, because what we have gone through above are good but just not perfect due to the below issues:

  • In the example, we only have 1 argument for the inquire_rate_online function, things are getting tedious if you have more arguments, all these arguments have to be stored as the key for the dictionary. In that case, re-implement the caching as a decorator function probably would be easier
  • Sometimes you do not really want to let garbage collection to determine which values to be cached longer than others, rather you want your cache to follow certain logic, for instance, based on the time from the most recent calls to the least recent calls, aka least recent used, to store the cache

If the least recent used cache mechanism makes sense to your use case, you shall consider to make use of the lru_cache decorator from functools module which will save you a lot of effort to reinvent the wheels.

Cache with lru_cache

The lru_cache accepts two arguments :

  • maxsize to limit the size of the cache, when it is None, the cache can grow without bound
  • typed when set it as True, the arguments of different types will be cached separately, e.g. wkrf_cached_inquire(1) and wkrf_cached_inquire(1.0) will be cached as different entries

With the understanding of the lru_cache, let’s decorate our inquire_rate_online function to have the cache capability:

from functools import lru_cache

@lru_cache(maxsize=None)
def inquire_rate_online(dimension):
    result = requests.get(f"https://postman-echo.com/get?dim={dimension}")
    if result.status_code == requests.codes.OK:
        data = result.json()
        return Rate(float(data["args"]["dim"]), float(data["args"]["dim"]))
    return Rate(0.0,0.0)

If we re-run our inquire_rate_online twice, you can see the same effect as previously in terms of the performance improvement:

Python cache with lru_cache

And with this decorator function, you can also see the how the cache is used. The hits shows no. of calls have been returned from the cached results:

inquire_rate_online.cache_info()
#CacheInfo(hits=1, misses=1, maxsize=None, currsize=1)

Or you can manually clear all the cache to reset the hits and misses to 0:

inquire_rate_online.cache_clear()

Limitations:

Let’s also talk about the limitations of the solutions we discussed above:

  • The cache mechanism works best for the deterministic function meaning by given the same set of inputs, it always returns the same set of results. And you would not benefit much if you try to cache the result of a nondeterministic function, e.g.:
def random_x(x):
    return x*random.randint(1,1000)
  • For keyword arguments, if you swap the position of the keywords, the two calls will be cached as separate entries
  • It only works for the arguments that are immutable data type.

Conclusion

In this article, we have discussed about the different ways of creating cache to improve the code performance whenever you have computational expensive operations or heavy I/O or network reads. Although lru_cache decorator provide a easy and clean solution for creating cache but it would be still better that you understand the underline implementation of cache before we just take and use.

We also discussed about the limitations for these solutions that you may need to take note before implementing. Nevertheless, it would still help you in a lot of scenarios where you can make use of these methods to improve your code performance.