Python

Python comprehension Photo by Karsten Würth on Unsplash

Python comprehensions for list, set and dictionary

Introduction

Python comprehension is a set of looping and filtering instructions for evaluating expressions and producing sequence output. It is commonly used to construct list, set or dictionary objects which are known as list comprehension, set comprehension and dictionary comprehension. Comparing to use map or filter or nested loops to generate the same result, comprehension has more concise syntax and improved readability. In this article, I will be explaining these three types of comprehensions with some practical examples.

Python comprehension basic syntax

You may have seen people using list/set/dict comprehension in their code, but if you have not yet checked the Python documentation, below is the syntax for Python comprehension.

Assignment Expression for target in iterable [if condition]

It requires a single assignment expression with at least one for statement, and following by zero or more for or if statements.

With this basic understanding, let’s dive into the examples.

List comprehension

List comprehension uses for and if statements to construct a list literal. The new list can be derived from any sequence like string, list, set and dictionary etc. For instance, if we have a list like below:

words = [
    "Serendipity",
    "Petrichor",
    "Supine",
    "Solitude",
    "Aurora",
    "Idyllic",
    "Clinomania",
    "Pluviophile",
    "Euphoria",
    "Sequoia"]

Single loop/if statement

You can use list comprehension to derive a new list which only keeps the elements with length less than 8 from the original list:

short_words = [word for word in words if len(word) < 8 ]

If you examine the short_words, you shall see only the short words (length less than 8) were selected to form a new list:

['Supine', 'Aurora', 'Idyllic', 'Sequoia']

Multiple if statements

As described earlier in the syntax, you can have multiple if conditions to filter the elements:

short_s_words = [word for word in words if len(word) < 8 if word.startswith("S") ]
#short_s_words = [word for word in words if len(word) < 8 and word.startswith("S") ]

The above two would generate the same result as per below:

['Supine', 'Sequoia']

Similarly, you can also use or in the if statement:

short_or_s_words = [word for word in words if len(word) < 8 or word.startswith("S") ]

You shall see the below result for the short_or_s_words variable:

['Serendipity', 'Supine', 'Solitude', 'Aurora', 'Idyllic', 'Sequoia']

Multiple loop/if statements

Sometimes you may have nested data structure and you would like to make it a flatten structure. For instance, to transform a nested list into a flatten list, you can make use of the list comprehension with multiple loops as per below:

lat_long = [[1.291173,103.810535], [1.285387,103.846082], [1.285803,103.845392]]
[x for pos in lat_long for x in pos]

Python will evaluate these expressions from left to right until the innermost block is reached. You shall see the nested list has been transformed into a flatten list:

[1.291173, 103.810535, 1.285387, 103.846082, 1.285803, 103.845392]

And similarly if you have multiple sequences to be iterated through, you can have multiple for statements in your comprehension or use zip function depends on what kind of results you what to achieve:

[(word, num) for word in words if word.startswith("S") for num in range(4) if num%2 == 0]

The above code would generate the output as per below:

[('Serendipity', 0),
 ('Serendipity', 2),
 ('Supine', 0),
 ('Supine', 2),
 ('Solitude', 0),
 ('Solitude', 2),
 ('Sequoia', 0),
 ('Sequoia', 2)]

If you use zip as per below, it will generates some different results.

[(word, num) for word, num in zip(words, range(len(words))) if word.startswith("S") and num%2 == 0]

Another practical example would be using list comprehension to return the particular type of files from the current and its sub folders. For instance, below code would list out all all the ipynb files from current and its sub folder but excluding the checkpoints folder:

import os

[os.path.join(d[0], f) for d in os.walk(".") if not ".ipynb_checkpoints" in d[0]
             for f in d[2] if f.endswith(".ipynb")]

Generate tuples from list comprehension

As you can see from the above examples, the list comprehension supports to generate list of tuples, but do take note that you have to use parenthesis e.g.: (word, len(word)) in the expression to indicate the expected output to be a tuple, otherwise there will be a syntax error:

[(word, len(word)) for word in words]

Set comprehension

Similar to list comprehension, the set comprehension uses the same syntax but constructs a set literal. For instance:

words_set = set(words)
short_words_set = {word for word in words_set if len(word) < 8}

The only difference between list comprehension and set comprehension is the square braces “[]” changed to curly braces “{}”.  And you shall see the same result as previous example except the data type now is a set:

{'Aurora', 'Idyllic', 'Sequoia', 'Supine'}

And same as list comprehension, any iterables can be used in the set comprehension to derive a new set. So using the list directly in below will also produce the same result as the above example.

short_words_set = {word for word in words if len(word) < 8}

Due to the nature of the set data structure, you shall expect the duplicate values to be removed when forming a new set with set comprehension.

Dictionary comprehension

With enough explanation in above, i think we shall directly jump into the examples, since everything is the same as list and set comprehension except the data type.

Below is an example:

dict_words = {word: len(word) for word in words}

It produces a new dictionary as per below:

{'Serendipity': 11,
 'Petrichor': 9,
 'Supine': 6,
 'Solitude': 8,
 'Aurora': 6,
 'Idyllic': 7,
 'Clinomania': 10,
 'Pluviophile': 11,
 'Euphoria': 8,
 'Sequoia': 7}

And similarly, you can do some filtering with if statements:

s_words_dict = {word: length for word, length in dict_words.items() if word.startswith("S")}

You can see only the keys starting with “s” were selected to form a new dictionary:

{'Serendipity': 11, 'Supine': 6, 'Solitude': 8, 'Sequoia': 7}

You can check another usage of dictionary comprehension from this post – How to swap key and value in a python dictionary

Limitations and constraints

With all the above examples, you may find comprehension makes our codes more concise and clearer comparing to use map and filter:

list(map(lambda x: x, filter(lambda word: len(word) < 8, words)))

But do bear in mind not to overuse it, especially if you have more than two loop/if statements, you shall consider to move the logic into a function, rather than put everything into a singe line which causes the readability issue.

The Python comprehension is designed to be simple and only support the for and if statements, so you will not be able to put more complicated logic inside a single comprehension.

Finally, if you have a large set of data, you shall avoid using comprehension as it may exhaust all the memory and causes your program to crash. An alternative way is to use the generator expression, which has the similar syntax but it produces a generator for later use. For instance:

w_generator = ((word, length) for word, length in dict_words.items() if word.startswith("S"))

It returns a generator and you can consume the items one by one:

for x in w_generator:
    print(x)

You can see the same result would be produced:

('Serendipity', 11)
('Supine', 6)
('Solitude', 8)
('Sequoia', 7)

Conclusion

In this article, we have reviewed though the basic syntax of the Python comprehension with some practical examples for list comprehension, set comprehension and dictionary comprehension. Although it is so convenient to use Python comprehension, you will still need to think about the readability/clarity of your code to other people and also the potential memory issues when you are working on a huge data set.

python datetime

Python datetime – the 9 tips you shall know

Introduction

Dealing with date and time are quite common whenever you are writing Python scripts, for instance, the simplest use cases would be logging some events with a timestamp, or saving a file with date and timing info in the file name. It can be challenging when you have more complicated scenarios such as handling time zone, daylight saving and recurrences etc. The built-in Python datetime module is capable of handling most of the date and time operations, and there are third party libraries can help you to easily manage the time zone and daylight saving challenges. In this article, we will be discussion some tips for using the Python datetime module as well as the third party package dateutil.

Prerequisite

If you do not have dateutil installed yet, you shall install the latest version to your working environment. Below is the pip command to install the package:

pip install python-dateutil

Let’s get started!

Various ways to get current date and time

The top one use cases that you need a Python datetime object is to get the current date or time. There are plenty of ways to get current date and time from Python datetime module, for instance:

>>>from datetime import datetime
>>>import time

#Local timezone
>>>datetime.now()
datetime.datetime(2020, 10, 24, 21, 31, 11, 761666)
>>>datetime.today()
datetime.datetime(2020, 10, 24, 21, 31, 12, 139719)

>>>datetime.fromtimestamp(time.time())
datetime.datetime(2020, 10, 24, 21, 31, 12, 559183)

#Not suggested
>>>datetime.fromtimestamp(time.mktime(time.localtime()))
datetime.datetime(2020, 10, 24, 21, 33, 5)

#UTC timezone
>>>datetime.now(timezone.utc)
datetime.datetime(2020, 10, 24, 13, 31, 13, 443442, tzinfo=datetime.timezone.utc)
>>>datetime.utcnow()
datetime.datetime(2020, 10, 24, 13, 31, 14, 240517)

Most of the above methods will return a date object in local machine time, and the last two methods will get the date and time in UTC time zone.

If you only need the date info, you can discard the time portion by using the date() method as per below:

>>>datetime.now().date()
datetime.date(2020, 10, 24)

Get year, month, day and time from Python datetime

From the datetime object, you can easily get each individual components such as year, month, day, hour etc. Below examples show you how to extract the date and time components from the datetime object, as well as the weekday or week number information:

>>>TODAY = datetime.today()
>>>TODAY.year, TODAY.month, TODAY.day, TODAY.hour, TODAY.minute, TODAY.second, TODAY.microsecond
(2020, 10, 24, 21, 36, 35, 842689)

#Monday is 0 and Sunday is 6
>>>TODAY.weekday()
5
#Monday is 1 and Sunday is 7
>>>TODAY.isoweekday()
6
#Return year, weekno, and weekday
>>>TODAY.isocalendar()
(2020, 43, 6)

Take note on the start day when you get the weekday in numbers, weekday() returns 0 for Monday, while isoweekday() returns 1 for Monday. There are some programming languages use 0 for Sunday, in this case you can use the %w format code to get the weekday number where it starts from 0 as Sunday.

Date plus or minus X days

Very often you will need to do some arithmetic calculation on the dates such as calculating number of days backward or forward from the current date. To do that, you will need to use the timedelta class. Below is the syntax to create a timedelta object, you can specify number of weeks, days, hours, minutes etc. for initialization:

>>>timedelta(days=1, seconds=50, microseconds=1000, milliseconds=1000, minutes=10, hours=6, weeks=1)
datetime.timedelta(days=8, seconds=22251, microseconds=1000)

All the arguments passed to timedelta will be eventually converted into days, seconds and microseconds.

So to calculate today plus 1 day, you can specify the timedelta with 1 day and add it up to the current date:

>>>tomorrow = datetime.today().date() + timedelta(days=1)
datetime.date(2020, 10, 25)

Similarly, calculating the date backwards can be achieved by specifying the arguments as negative numbers:

>>>yesterday = datetime.today().date() + timedelta(days=-1) 
datetime.date(2020, 10, 23)

When calculating the difference between two dates, it will also return a timedelta object:

>>>tomorrow - yesterday
datetime.timedelta(days=2)

Get the first day of the month

With the replace() method, you can replace the year, month or day of the current date and return a new date. The most commonly used scenario would be getting the first day of the month based on current date, e.g.:

>>>datetime.today().date().replace(day=1)
datetime.date(2020, 10, 1)

Format date with strftime and strptime

There are many scenarios that you need to convert a date from string or format a date object into a string. You can easily convert a date into string format with the strftime method, for instance:

>>>datetime.now().strftime("%Y-%b-%d %H:%M:%S")
'2020-Oct-25 20:35:54

And similarly, from string you can use strptime to convert a string object into a date object:

>>>datetime.strptime("Oct 25 2020 08:10:00", "%b %d %Y %H:%M:%S")
datetime.datetime(2020, 10, 25, 8, 10)

You can check here for the full list of the format code supported by strftime and strptime. And do take note that strptime can be much slower than you expected if you are using it in a loop for a large set of data. For such case, you may consider to directly use datatime.datetime(year, month, day) to form the datetime object.

Create time zone aware date

Most of the methods in Python datetime module return time zone naive objects (which means it does not include any timezone info), in case you need some time zone aware objects, you can specify the time zone info when initializing a date/time object, for instance:

>>>singapore_tz = timezone(timedelta(hours=8), name="SGT")
>>>sg_time_now = datetime.now(tz=singapore_tz)
datetime.datetime(2020, 10, 24, 22, 31, 6, 554991, tzinfo=datetime.timezone(datetime.timedelta(seconds=28800), 'SGT'))

If you use 3rd party libraries like pytz or dateutil, you can easily get the time zone info by supplying IANA time zone database name or Windows time zone names. Below is the example for dateutil:

>>>import dateutil

#time zone database name from IANA
>>>sh_tz = dateutil.tz.gettz('Asia/Shanghai')
>>>datetime(2020, 10, 24, 22, tzinfo = sh_tz)
datetime.datetime(2020, 10, 24, 22, 0, tzinfo=tzfile('PRC'))

#windows time zone names
>>>cn_tz = dateutil.tz.gettz('China Standard Time')
>>>datetime(2020, 10, 24, 22, tzinfo = cn_tz)
datetime.datetime(2020, 10, 24, 22, 0, tzinfo=tzwin('China Standard Time'))

With the time zone database, you do not need to worry about the offset hours, and you only need to provide the name to get the correct date and time in the respective time zone.

Get a date by relative period

If you use timedelta to get a date from the current date plus a relative period such as 1 year or 1 month, you may sometimes run into problems during the leap years. For instance, the below returns Apr 30 as year 2020 is leap year, and the number of days shall be 366 rather than 365.

>>>datetime(2019, 5, 1) + timedelta(days=365)
datetime.datetime(2020, 4, 30, 0, 0)

The simply way to get the correct result is to use the relativedelta from the dateutil package, e.g.:

>>>from dateutil.relativedelta import relativedelta

>>>datetime(2019, 5, 1) + relativedelta(years=1)
datetime.datetime(2020, 5, 1, 0, 0)

You can also specify the other arguments such as the months, days and hours:

>>>datetime.today() + relativedelta(years=1, months=1, days=10, hours=10)
datetime.datetime(2021, 12, 5, 8, 49, 31, 386813)

To get the date of the next Sat from the current date, you can use :

>>>datetime.today() + relativedelta(weekday=calendar.SATURDAY)
datetime.datetime(2020, 10, 24, 15, 16, 10, 502191)

Take note that if you are running it on Saturday before 23:59:59, it will just return the current date, so it is actually returning the nearest Saturday from your current date.

List out all the weekdays

In case you need to get all the weekdays starting from a particular date, you can make use of the recurrence rules from dateutil package.

For instance, the below rrule specifies to recur on daily basis for Mon to Fri with a start and end date:

>>>from dateutil.rrule import rrule, DAILY, MO, TU, WE, TH, FR
>>>from dateutil.parser import parse

>>>list(rrule(DAILY, interval=1, byweekday=[MO, TU, WE, TH, FR], dtstart=datetime.now().date(), until=datetime(2020, 11, 2)))

[datetime.datetime(2020, 10, 26, 0, 0),
 datetime.datetime(2020, 10, 27, 0, 0),
 datetime.datetime(2020, 10, 28, 0, 0),
 datetime.datetime(2020, 10, 29, 0, 0),
 datetime.datetime(2020, 10, 30, 0, 0),
 datetime.datetime(2020, 11, 2, 0, 0)]

The frequency and interval arguments determine the frequency of the recurrence, and the byweekday and dtstart further constrain which are the dates to be selected.

Besides the weekday argument, you can also specify by year, month, hour, minute etc. You can check here for all the available arguments supported for instantiating the rrule object.

Another example, the below code returns a list of dates recurring on 9:15am every another day:

>>>list(rrule(DAILY, interval=2, byminute=15, count=4, dtstart=parse("20201024T090000")))
[datetime.datetime(2020, 10, 24, 9, 15),
 datetime.datetime(2020, 10, 26, 9, 15),
 datetime.datetime(2020, 10, 28, 9, 15),
 datetime.datetime(2020, 10, 30, 9, 15)]

Get a list of business days

Sometimes you would need to exclude the public holidays to get only the business days. To do so, you may first get a list of holidays from another 3rd party libraries like holidays or simply put all holidays into some config file, and then exclude these dates from rrule. For instance:

holidays = [
    datetime(2020, 7, 10,),
    datetime(2020, 7, 31,),
    datetime(2020, 8, 10,)
]
r = rrule(DAILY, interval=1, byweekday=[MO, TU, WE, TH, FR],
   dtstart=datetime(2020, 7, 10), until=datetime(2020, 8, 1))

rs = rrule.rruleset()
rs.rrule(r)

for d in holidays:
    rs.exdate(d)

print(list(rs))

You can see the public holidays have been excluded from the return results:

python datetime - dateutil output

Conclusion

Working with date sometimes can be tough especially when you need to manipulate the dates in different time zones or considering the daylight savings. Luckily with Python datetime and other 3rd party libraries like dateutil, things are getting easier. But you will still need to be very careful when handling dates with time zone or setting up recurrence rules in local time.

Thanks for reading, and you can find other Python related topics from here.

argparse pass argument to python script

10 tips for passing arguments to Python script

When writing some Python utility scripts, you may wish to make your script as flexible as possible so that people can use it in different ways. One approach is to parameterize some of the variables originally hard coded in the script, and pass them as arguments to your script. If you have only 1 or 2 arguments, you may find sys.argv is good enough since you can easily access the arguments by the index from argv list. The limitation is also obvious, as you will find it’s difficult to manage when there are more arguments, and some are mandatory and some are optional, also you cannot specify the acceptable data type and add proper description for each argument etc.

In this article, we will be discussing some tips for the argparse package, which provides easier way to manage your input arguments.

To get started, you shall import this package into your script, and try to run with some sample code like below:

import argparse
parser = argparse.ArgumentParser()
parser.add_argument('--foo', help='foo help')
args = parser.parse_args()
print(args)

Customize your prefix_chars

Most of time you would see people use “-” before the argument name, you can change this default behavior to support more prefix characters, such as + or \ etc. To do that, you can specify them in prefix_chars when initializing the argument parser, for instance:

parser = argparse.ArgumentParser(prefix_chars='-+/', description="This is to demonstrate multiple prefix characters")
parser.add_argument("+a", "++add")
parser.add_argument("-s", "--sub")
parser.add_argument("/d", "//dir")
args = parser.parse_args()
print(args)

When you save above as argumentparser.py file and call it with below input arguments, you shall see all the arguments are parsed correctly as per expected:

>>python argumentparser.py +a 1 -s 2 /d python
Namespace(add='1', dir='python', sub='2')

Do take note that, if your argument name contains the prefix character “-“, you may see “-” character being replaced to “_”. For example, your argument name read-only would be replaced to read_only, and you shall use args.read_only to reference the value.

Argument data type

When you are adding new arguments, the default data type is always string, so whatever values followed behind the argument name will be converted into a string. Argument parser supports all immutable data types, so you can specify a data type and let argument parser to validate if correct data type has been passed in. E.g.:

parser.add_argument("-c", "--count", type=int)

You shall see the below validation error if incorrect data type has been passed in:

>>python argumentparser.py -c 1.5
usage: argumentparser.py [-h] [-c COUNT]
argumentparser.py: error: argument -c/--count: invalid int value: '1.5'

Various argument actions

The action keyword in add_argument allows you to specify how you want to handle the arguments when they are passed into the script. Some of the commonly used actions are:

  • store – default behavior
  • store_const – work together with const keyword to store it’s value
  • store_true or store_false – set the argument value to True or False
  • append – allows the same argument to appear multiple times and store the argument values into a list
  • append_const – same as append, but it will store the value from const keyword
  • count – count how many times the argument appears

Below are some examples:

parser.add_argument('-a', '--auto', action="store_true", help="to run automatically")
parser.add_argument("-k", "--kelvin",
                        action="store_const",
                        const=273.15,
                        help="The constant to convert Celsius to Kelvin temperature")

parser.add_argument("-t", "--temperature",
                        type=float,
                        action="append",
                        default=[],
                        help="Celsius temperature to be used in %(prog)s")

parser.add_argument('--age', dest='criteria', action='append_const', const=18)
parser.add_argument('--gender',dest='criteria', action='append_const', const="male")
parser.add_argument("-c", "--count", action="count")

When you run in the command line, you shall see all these arguments are parsed correctly and stored into the respective variables:

>>python argumentparser.py -k -t 35.1 -t 37.5 --age --gender -cc -a
Namespace(auto=True, count=2, criteria=[18, 'male'], kelvin=273.15, temperature=[35.1, 37.5])

In Python version 3.8 and later, you can also extend your own class from argparse.Action and pass it to the action.

Use action=”append” or narg=”+” ?

If you want to collect a list of values from a particular input argument, you have two options:

  • specify action = “append”
  • specify the nargs=”+”

For the below code, both “amount” and “nums” will be able to store a list of values from the input:

parser.add_argument("-a", "--amount",
                        type=float,
                        action="append")
parser.add_argument("-n", "--nums", nargs="+")

The only difference is that, for “append” action, you will need to repeat the argument name whenever you need to add extra values. While for “nargs”, you just need to put all the space separated values after the argument name. E.g.:

>>python argumentparser.py -a 1 -a 2 -n 3 4
Namespace(amount=[1.0, 2.0], nums=['3', '4'])

You may notice that if have any argument with nargs=”+”, it’s better always put it after all the positional arguments, as the argument parser would take your positional argument as part of the previous argument. (see the example in the next tips)

Mixing of positional and optional arguments

When there is no prefix characters used in the argument name, the argument parser will treat it as a positional argument. For instance:

parser.add_argument("caller", help="The process that invoke this script")
parser.add_argument("-c", "--count")

When you check the help for this script, you shall see caller is taken as positional argument.

>>python argumentparser.py -h
usage: argumentparser.py [-h] [-c COUNT] caller

positional arguments:
  caller                The process that invoke this script

optional arguments:
  -h, --help            show this help message and exit
  -c COUNT, --count COUNT

Positional arguments are considered as mandatory, so Python will throw error if they are not specified when calling the script. You can put positional argument at any place of your input argument stream. E.g.:

>>python argumentparser.py -c 2 "cmd.exe"
>>python argumentparser.py "cmd.exe" -c 2
Namespace(caller='cmd.exe', count=2)

Python is smart enough to interpret and assign the values to the correct variables unless there is some confusion when trying to interpret your input arguments, e.g.: If you use nargs to indicate multiple argument values can be passed in:

parser.add_argument("-c", "--count", nargs='+')

And putting your positional argument behind this argument will cause error, because all the values behind “-c” will be taken as the values for “count”

>>python argumentparser.py -c 1 3 "cmd.exe"
usage: argumentparser.py [-h] [-c COUNT [COUNT ...]] caller
argumentparser.py: error: the following arguments are required: caller

Difference between const vs default

const keyword usually works together with action option – store_const or append_const to store the value from the const keyword when the argument appears. If the argument is not supplied, the argument variable will be set as None. Consider the below two arguments:

parser.add_argument("-k", "--kelvin",
                        action="store_const",
                        const=273.15,
                        help="The constant to convert celsius to Kelvin temperature")
parser.add_argument("-c", "--count", default=0)

If you run with below input arguments, you shall the similar result as below:

>>python argumentparser.py -k
Namespace(count=0, kelvin=273.15)
>>python argumentparser.py -c 1
Namespace(count='1', kelvin=None)
>>python argumentparser.py -k 270
usage: argumentparser.py [-h] [-k] [-c COUNT]
argumentparser.py: error: unrecognized arguments: 270

So with const keyword, you basically cannot specify any other values. but still you can add a default value, so that when the argument is supplied, the default value will be set as the default value rather than None.

Mandatory optional argument?

If you would like your optional argument to be mandatory (although it sounds a bit weird), you can specify the required option as True in the add_argument method, e.g.:

parser.add_argument("--data-type", required=True)

With required as True, even you have specified the default option, python will still prompt error saying the argument –data-type is required.

Ignore case in choice option

Image you are implementing some automation scripts to be triggered in various mode, and you would like to limit the options to be accepted for this mode argument, you can specify a list of values in the choices keyword when adding the argument:

parser.add_argument('-m','--mode', choices=['AUTO','SCHEDULER','SEMI-AUTO'])

But you may realize one problem as when you specify “auto” or “Auto”, you would see below error message:

>>python argumentparser.py -m "auto"
usage: argumentparser.py [-h] [-m {AUTO,SCHEDULER,SEMI-AUTO}]
argumentparser.py: error: argument -m/--mode: invalid choice: 'auto' (choose from 'AUTO', 'SCHEDULER', 'SEMI-AUTO')

By default, the argument parser will compare the values in case sensitive manner. To ignore the cases, you can specify a type keyword and transform the input values into upper or lower case:

parser.add_argument('-m','--mode', choices=['AUTO','SCHEDULER','SEMI-AUTO'], type=str.upper)

Conflicting options

Sometimes defining some mutually exclusive arguments can be very useful as you do not wish the two or multiple options to be used at the same time. argparse package also provides a easy way to group these options with necessary validations on the input arguments. For instance, you can group the “auto” and “on-demand” mode into the mutually exclusive group, so that only one mode can be activated at one time:

mode_group = parser.add_mutually_exclusive_group()
mode_group.add_argument('-a', '--auto', action="store_true", help="to run automatically")
mode_group.add_argument('-d', '--on-demand', action="store_true", help="to run on demand")

If both arguments are supplied, you would see the below error message:

>>python argumentparser.py -d -a
usage: argumentparser.py [-h] [-a | -d]
argumentparser.py: error: argument -a/--auto: not allowed with argument -d/--on-demand

Conclusion

argpase package is super useful when you need to write some script to be executed from the command line. In this article, we have reviewed through some tips that might help you to extend your understanding on the different use cases for each individual options argparse provided. If you have more complicated use case, you may want to read further on the official documentation such as the sub-commands and file type etc.

Photo by Aron Visuals on Unsplash

How to calculate date difference between rows in pandas

Problem:

You have some data with date (or numeric) data columns, you already knew you can directly use – operator to calculate the difference between columns, but now you would like to calculate the date difference between the two consecutive rows.

For instance, You have some sample data for GPS tracking, and it has the start and end time at each location (latitude and longitude). You would like to calculate the time gap within each location or between two locations.

import pandas as pd
import numpy as np

df = pd.read_excel("GPS Data.xlsx")
df.head(10)

For a quick view, you can see the sample data output as per below:

pandas calculate date difference between two consecutive rows

Solutions:

  Option 1: Using Series or Data Frame diff

Data frame diff function is the most straightforward way to compare the values between the current row and the previous rows. By default, it compare the current and previous row, and you can also specify the period argument in order to compare the current row and current – period row. To calculate the time gap of the start time between two consecutive rows:

df["Start Time"].diff()

You shall see the below output:

pandas calculate date difference between two consecutive rows

If you check the date type of the result, you may see it is showing as dtype(‘<m8[ns]’), you can convert the result into integer minutes for easier reading.

In this case, you can use the below timedelta from numpy to convert the date difference into minutes:

df["Start Time"].diff().apply(lambda x: x/np.timedelta64(1, 'm')).fillna(0).astype('int64')

You shall see the below output:

pandas calculate date difference between two consecutive rows

You can also select multiple date columns as a data frame to apply the diff function.

  Option 2: Using Series or Data Frame shift with – operator

Shift function allows us to move the values up/down or left /right to the given periods depends on what axis you have specified. You can imagine it is the same as Excel shift cells function.

To calculate the difference between the current and next row, you will need to shift the subtrahend column up 1 cell, below is how to calculate the difference between current End Time and the Start Time from the following row:

df["End Time"] - df["Start Time"].shift(1)

Yous shall see the below result:

pandas calculate date difference between two consecutive rows

If you want to calculate the difference for multiple date columns, you can use the data frame shift operation as per below:

df[["End Time", "Start Time"]] - df[["Start Time", "End Time"]].shift(1)

pandas calculate date difference between two consecutive rows

  Option 3: Using data frame sub

The data frame sub function is self-explanatory by it’s name. You can either apply the subtraction at row level or column level by specifying the aixs argument. For our case, to calculate the date difference between two rows, you can use the original data frame to subtract another data frame which starts from the second row of the original data frame. Below is the code:

(df.loc[:,["Start Time", "End Time"]].sub(df.loc[0,["Start Time", "End Time"]], axis='columns')/np.timedelta64(1, "m")).astype("int64")

You can see the below output:

pandas calculate date difference between two consecutive rows

If you would like to calculate the gap between current End Time and next Start Time, you can use the below:

df["End Time"].sub(df["Start Time"].shift(1))

It should produce the same result as previously when we use – with shift.

Conclusion:

Among the 3 options we discussed above, using diff is the most straightforward approach, but you may notice that it can only apply the calculation on the same columns, if you would like to calculate the difference between the End Time of the current row and the Start Time of the next row, you will have to use sub or – with shift operation. One more difference between diff and sub is that sub has the fill_value argument which supports to substitute the missing values with a default value, so that you do not need another line of code to handle the NA values.

pandas split one row of data into multiple rows

Pandas tricks – split one row of data into multiple rows

As a data scientist or analyst, you will need to spend a lot of time wrangling the data from various sources so that you can have a standard data structure for your further analysis. There are cases that you get the raw data in some sort of summary view and you would need to split one row of data into multiple rows based on certain conditions in order to do grouping and matching from different perspectives. In this article, we will be discussing a solution to solve this particular issue.

Prerequisites:

You will need to get pandas installed if you have not yet. Below is the pip command to install pandas:

pip install pandas

And let’s import the necessary modules and use this sample data for our demonstration, you can download it into your local folder, or just supply this URL link to pandas read_excel method:

import pandas as pd
import numpy as np

df = pd.read_excel("eShop-Delivery-Record.xlsx", sheet_name=0)

So if we do a quick view of the first 5 rows of the data with df.head(5), you would see the below output:

pandas split one row of data into multiple rows

Assume this is the data extracted from a eCommerce system where someone is running a online shop for footwear and apparel products, and the shop provides free 7 days return for the items that it is selling. You can see that each of the rows has the order information, when and who performed the delivery service, and if customer requested return, when the item was returned and by which courier service provider. The data is more from the shop owner’s view, and you may find some difficulty when you want to analyse from courier service providers’ perspective with the current data format. So probably we shall do some transformation to make the format simpler for analysis.

Split one row of data into multiple rows

Now let’s say we would like to split this one row of data into 2 rows if there is a return happening, so that each row has the order info as well as the courier service info and we can easily do some analysis such as calculating the return rate for each product, courier service cost for each month by courier companies, etc.

The output format we would like to have is more like a transaction based, so let’s try to format our date columns and rename the delivery related columns, so that it won’t confuse us later when splitting the data.

df["Delivery Date"] = pd.to_datetime(df["Delivery Date"]).dt.date
df["Return Date"] = pd.to_datetime(df["Return Date"]).dt.date

df.rename(columns={"Delivery Date" : "Transaction Date",
"Delivery Courier" : "Courier",
"Delivery Charges" : "Charges"}, inplace=True)

And we add one more column as transaction type to indicate whether the record is for delivery or return. For now, we just assign it as “DELIVERY” for all records:

df["Transaction Type"] = "DELIVERY"

The rows we need to split are the ones with return info, so let’s create a filter by checking if return date is empty:

flt_returned = ~df["Return Date"].isna()

If you verify the filter with df[flt_returned], you shall see all rows with return info are selected as per below:

pandas split one row of data into multiple rows

To split out the delivery and return info for these rows, we will need to perform the below steps:

  • Duplicate the current 1 row into 2 rows
  • Change the transaction type to “RETURN” for the second duplicated row
  • Copy values of the Return Date, Return Courier, Return Charges to Transaction Date, Courier, Charges respectively

To duplicate these records, we use data frame index.repeat() to repeat these index twice, and then use loc function to get the data for these repeated indexes. Below is the code to create the duplicate records for the rows with return info:

d = df[flt_returned].loc[df[flt_returned].index.repeat(2),:].reset_index(drop=True)

Next, let’s save the duplicated row indexes into a variable, so that we can refer to it multiple times even when some data in the duplicated row changed. We use the data frame duplicated function to return the index of the duplicated rows. For this function, the keep=”first” argument will mark 1st row as non-duplicate and the subsequent rows as duplicate, while keep=”last” will mark the 1st row as duplicate.

idx_duplicate = d.duplicated(keep="first")
#the default value for keep argument is "first", so you can just use d.duplicated()

With this idx_duplicate variable, we can directly update the transaction type for these rows to RETURN:

d.loc[idx_duplicate,"Transaction Type"] = "RETURN"

And next, we shall copy the return info into Transaction Date, Courier, Charges fields for these return records. You can either base on the transaction type value to select rows, or continue to use the idx_duplicate to identify the return records.

Below will copy values from Return Date, Return Courier, Return Charges to Transaction Date, Courier, Charges respectively:

d.loc[idx_duplicate, ["Transaction Date", "Courier", "Charges"]] = d.loc[idx_duplicate, 
                                                     ["Return Date", "Return Courier","Return Charges"]].to_numpy()

If you check the data now, you shall see for the return rows, the return info is already copied over:

pandas split one row of data into multiple rows

(Note: you may want to check here to understand why to_numpy() is needed for swapping columns)

Finally, we need to combine the original rows which only has delivery info with the above processed data. Let’s also sort the values by order number and reset the index:

new_df = pd.concat([d, df[~flt_returned]]).sort_values("Order#").reset_index(drop=True)

Since the return related columns are redundant now, we shall drop these columns to avoid the confusion, so let’s use the below code to drop them by the “Return” keywords in the column labels:

new_df.drop(new_df.filter(regex="Return*", axis=1), axis=1, inplace=True)

(To understand how df.filter works, check my this article)

Once we deleted the redundant columns, you shall see the below final result in the new_df as per below:

pandas split one row of data into multiple rows

So we have successfully transformed our data from a shop owner’s view to courier companies’ view, each of the delivery and return records are now an individual row.

Conclusion

Data wrangling sometimes can be tough depends on what kind of source data you get. In this article, we have gone through a solution to split one row of data into multiple rows by using the pandas index.repeat to duplicate the rows and loc function to swapping the values. There are other possible ways to handle this, please do share your comments in case you have any better idea.