Resources

20 Useful Tips for Using Python Pip

20 Tips for Using Python Pip

Introduction

Python has become one of the most popular programming languages due to the easy to use syntax as well as the thousands of open-source libraries developed by the Python community. Almost every problem you want to solve, you can find a solution with these third-party libraries, so that you do not need to reinvent the wheels. Majority of these libraries are hosted in the repository called Pypi and you can install these libraries with the Python pip command.

Python pip module helps you to manage the downloading, installation of the packages, and solving the dependency requirements. Although you probably have used pip for some time, you may not spend much time to read through it’s user guide for some of the useful operations. In this article, we have summarize the 20 useful tips for managing Python third party packages with Python pip.

Check the current pip version

Since Python version 3.4, the pip module has been included by default within the Python binary installer, so you do not need to install it separately once you have Python program installed. To check the version of the pip package, you can use the below:

pip --version

Sample output:

Python Pip version

Install package from Pypi

Installing package is very simple with pip command, you can use “install” option followed by one or multiple package names:

pip install requests

By default, pip looks for the latest release and install the latest version for you together with the dependency packages. Sample output as per below:

Python Pip install package

You can also specify the version number of the package to be installed:

py -m pip install pip==21.1.1

Sample output:

Python Pip install package with version number

Pip also supports a list of version specifier such as >=1.2, <2.0, ~=2.0, !=2.0 or ==1.9.* for matching the correct version of the package to be installed.

When you are not in a virtual environment, the package will be installed into the global folder (system-site) by default, you can use the “–user” option to specify the installation folder in user-site in case of any permission issue. E.g.:

pip install --user requests

Output as per below:

Python Pip install package to user-site

Although you can specify your own customized installation path for your different projects, using virtual environment is still the best way to manage dependencies and conflicts.

Show package version and installation location

To check the basic information such as version number or installation location for an existing package, you can use the “show” option:

pip show colorama

You can see the below information about the package:

Python Pip show package version and installation location

And you can also use the “–verbose” mode to display the additional meta info.

List all the packages installed

To list out all the packages installed, you can use the “list” option:

py -m pip list

You shall see the output format similar to below:

Python Pip list packages

You can add a “–user” option to list all packages installed in your user-site, e.g.:

py -m pip list --user

When you are using virtual environment with “–system-site-packages” (allowing virtual environment to access system-site packages), you can use the “list –local” option to show only the packages installed in your virtual environment:

py -m pip list --local

List all the outdated packages

To check if any installed packages are outdated, you can use the “–outdated” option:

py -m pip list -o
# or
py -m pip list --outdated

Below is the sample output:

Python Pip list outdated packages

Upgrade package to the latest version

Once identified the outdated packages, you can manually use the “–upgrade” option to upgrade the package to the latest version. Multiple package name can be specified with whitespaces:

py -m pip install --upgrade pip
#or 
py -m pip install --U pip setuptools

Sample output as per below:

Python Pip upgrade package

Auto upgrade packages to the latest version

Pip does not have an option to auto upgrade the outdated packages, but you can make use of the result from “list -o” and create a simple script to achieve it, e.g.:

#in Windows command line
for /F "skip=2 delims= " %i in ('pip list --o --local') do pip install -U %i

#in linux
pip list --o --local | grep -v '^\-e' | cut -d = -f 1  | xargs -n1 pip install -U

Export installed packages

You can use “freeze” option to export all your installed package names into a text file, so that you can re-create exactly the same project environment in another PC. For instance:

py -m pip freeze -l > requirements_demo.txt

Result in the output text file:

Python Pip install requirement file

Install multiple packages from requirement file

For the packages you’ve exported with “freeze” option, you can re-install all the packages in another environment with the below “-r” option:

py -m pip install -r requirements.txt

You may see the below output when you have package name “numpy” in your requirements.txt file:

Python Pip install package with requirement file

The requirements.txt also allows to include other requirement files. This may be useful when you have a sub module requires extra packages and can be run independently as a separate application. So you may put the common packages in the requirements.txt and the additional packages in the requirements_module1.txt file,  the include the requirements.txt file in your module file.

E.g. the content in the requirements_module1.txt:

#opencv-python
#comment out some packages
python-dateutil

-r requirements.txt

When you run the “install” command:

py -m pip install -r requirements_module1.txt

You shall the sample output as per below:

Python Pip install package with multiple requirement files

Uninstall packages

Uninstalling an existing package can be done with below command:

pip uninstall numpy

Output as per below:

Python Pip uninstall package

Install package from wheel file

When you have a binary wheel file downloaded in your local folder, you can also use the “install” option to install the wheel file directly:

py -m pip install --force-reinstall C:\Users\codef\Downloads\python_dateutil-2.8.2-py2.py3-none-any.whl

Output as per below:

pip install whl

Install package from non-Pypi index

If the package is not hosted in Pypi index, you can manually specify the index url with “–index-url” or simply “-i” :

py -m pip install -i https://mirrors.huaweicloud.com/repository/pypi/simple/ colorama

Above command would download and install the package from huawei cloud repository (a PyPi mirror):

Python Pip install package from Pypi mirrors

This would be also helpful when you are not able to access the Pypi directly due to the firewall or proxy issue in your network, you can find a Pypi mirror repository and download the packages from there. Usually these mirrors synchronize with Pypi in a few minutes interval which should not cause any issue for your development work.

Configure global index url

To permanently save the index url so that you do not have to key in the url for every package installation, you can use the “config” option to set the url globally. e.g:

pip config set global.index-url https://mirrors.aliyun.com/pypi/simple

With the above setting, you can install package from the mirror repository as per normal without specifying the url option.

Check package compatibility

When you manually install the packages, sometimes you may encounter issues that some dependency packages

having incompatible version installed. To check if you have any such issue, you can use the “check” option :

python -m pip check

You may see something similar to below when there is any conflict:

Python Pip check package compatibility

Download package into local folder

You can download the package wheel files into your local folder when you need:

pip download requests -d .\requests

The “-d” option allows you specify the target folder where you want to save the wheel files. You may get multiple wheel files if the package has any dependency packages. (you can use “–no-deps” when you do not want to download the dependency files)

Below is the sample result:

 

Python Pip download wheels file

Install package into local folder

To install the package from a folder, you can use the “-f” with the file path:

pip install requests -f .\requests

This is the same as installing the package from Pypi:

Python Pip install package offline

Conclusion

In this article we have summarized some useful tips for using Python pip to manage the installation and upgrading of the third party packages for your Python projects. For more advanced usage of this module, you may refer to it’s official document.

reading email from outlook with python pywin32

5 Useful Tips for Reading Email From Outlook In Python

Introduction

Pywin32 is one of the most popular packages for automating your daily work for Microsoft outlook/excel etc. In my previous post, we discussed about how to use this package to read emails and save attachments from outlook. As there were quite many questions raised in the comments which were not covered in the original post, this article is intended to review through some of the advanced topic for reading emails from outlook via Python Pywin32 package.

If you have not yet read through the previous post, you may check it out from here.

Prerequisites:

Assuming you have already installed the latest Pywin32 package and imported below necessary packages in your script, and you shall not encounter any error after executing the GetNamespace method to establish the outlook connection:

import win32com.client

#other libraries to be used in this script 
import os 
from datetime import datetime, timedelta

outlook = win32com.client.Dispatch('outlook.application') 
mapi = outlook.GetNamespace('MAPI')

When using below code to iterate the Accounts property, you shall see whichever accounts you have configured in your outlook:

for account in mapi.Accounts: 
    print(account.DeliveryStore.DisplayName)

#Assuming below accounts have been configured:
#[email protected]
#[email protected]

Now let’s move on to the topics we are going to discuss in this article.

Reading Email from Multiple Outlook Accounts

If you have multiple accounts configured in your outlook application, to access one of the accounts, you can use the Folders method and specify the account name or index of the account, e.g.:

for idx, folder in enumerate(mapi.Folders):
    #index starts from 1
    print(idx+1, folder)

#Assuming below output:
# 1  [email protected]
# 2  [email protected]

And to access the sub folders under a particular email account, you can continue to use the folders method to specify the sub folder name or index of the folder. Before that, you may want to check what are the available sub folders and it’s index value as per below:

for idx, folder in enumerate(mapi.Folders("[email protected]").Folders):
    print(idx+1, folder)
# or using index to access the folder
for idx, folder in enumerate(mapi.Folders(1).Folders): 
    print(idx+1, folder)

You shall see something similar to the below:

reading email from outlook with Python pywin32

With the above folder index and name, you shall be able to access the email messages as per below:

messages = mapi.Folders("[email protected]").Folders("Inbox").Items
# or
messages = mapi.Folders(1).Folders(2).Items
for msg in list(messages):
    print(msg.Subject)

Although the index would not get changed when you move up/down of your folders in outlook, obviously using folder name still is much better than index in terms of readability of the code.

Filter Email Based on Receiving Time Window

When reading emails from outlook inbox, you may want to zoom into the emails within a specific receiving time window rather than scanning through thousands of emails you have received in the inbox. To filter emails based on certain conditions, you can use restrict method together with the logical operators.

For instance, to filter the emails received from 1st day of the current month until today 12am:

today = datetime.today()

# first day of the month
start_time = today.replace(month=1, hour=0, minute=0, second=0).strftime('%Y-%m-%d %H:%M %p')

#today 12am
end_time = today.replace(hour=0, minute=0, second=0).strftime('%Y-%m-%d %H:%M %p')

messages = messages.Restrict("[ReceivedTime] >= '" + start_time
+ "' And [ReceivedTime] <= '" + end_time + "'")

With logical operators like AND, OR and NOT, you are able to combine multiple criteria together. For instance, to check the email with certain subject but not from a particular sender email:

messages = messages.Restrict("[Subject] = 'Sample Report'" 
                             + " And Not ([SenderEmailAddress] = '[email protected]')")

And you can also use the Restrict method as many times as you wish if it makes your code more readable than combining all conditions in one filter, e.g.:

messages = messages.Restrict("[Subject] = 'Sample Report'")
messages = messages.Restrict("Not ([SenderEmailAddress] = '[email protected]')")

Getting First N emails

When using Restrict method for filtering email messages, you would not be able to specify max number of emails you want to read. If you wish to get the first/last N emails based on the receiving time, you can use the Sort method to sort the messages based on certain email properties before you slice the list. Below is the sample code to get the latest 10 email messages based on the receiving time:

messages.Sort("[ReceivedTime]", Descending=True)

#read only the first 10 messages
for message in list(messages)[:10]:
    print(message.Subject, message.ReceivedTime, message.SenderEmailAddress)

Wildcard Matching for Filtering

With the Restrict method, you cannot do wildcard matching such as searching whether the email subject or body contains certain keywords. To be able to achieve that, you will need to use the DASL query.

For instance, with the below DASL query syntax, you can filter email subject which contains “Sample Report” keyword:

messages = messages.Restrict("@SQL=(urn:schemas:httpmail:subject LIKE '%Sample Report%')")

You may want to check here to see what are the fields supported in ADSL query and the correct namespace to be used.

Include/Exclude Multiple Email Domains

To filter the emails only from a particular domain, you can use the ADSL query similar to the previous example:

messages = messages.Restrict("@SQL=(urn:schemas:httpmail:SenderEmailAddress LIKE '%company.com')")

And to exclude the emails from a few domains, you can use multiple conditions with logical operators:

messages = messages.Restrict("@SQL=(Not(urn:schemas:httpmail:senderemail LIKE '%@abc%') \
And Not(urn:schemas:httpmail:senderemail LIKE '%@123%') \
And Not(urn:schemas:httpmail:senderemail LIKE '%@xyz%'))")

Conclusion

In this article, we have reviewed through some advanced usage of the Pywin32 package for filtering emails. You may not find many Python tutorials for this package from online directly, but you shall be able to see the equivalent VBA code from its official website for most of the code you have seen in this article. In the event that you cannot find a solution for you problem, you may check and see whether there is something implemented in VBA code that you can convert it into Python syntax.

Link for the previous post Reading Email From Outlook In Python.

common python mistakes for beginners

8 Common Python Mistakes You Shall Avoid

Introduction

Python is a very powerful programming language with easily understandable syntax which allows you to learn by yourself even you are not coming from a computer science background. Through out the learning journey, you may still make lots mistakes due to the lack of understanding on certain concepts. Learning how to fix these mistakes will further enhance your understanding on the fundamentals as well as the programming skills.

In this article, I will be summarizing a few common Python mistakes that many people may have encountered when they started the learning journey and how they can be fixed or avoided.

Reload Modules after Modification

Have you ever wasted hours to debug and fix an issue and eventually realized you were not debugging on your modified source code? This usually happens to the beginners as they did not realize the entire module was only loaded into memory once when import statement was executed. So if you are modifying some code in separate module and import to your current code, you will have to reload the module to reflect the latest changes.

To reload a module, you can use the reload function from the importlib module:

from importlib import reload

# some module which you have made changes
import externallib

reload(externallib)

Naming Conflict for Global and Local Variables

Imagine you have defined a global variable named app_config, and you would like to use it inside the init_config function as per below:

app_config = "app.ini"

def init_config():
    app_config = app_config or "default.ini"
    print(app_config)

You may expect to print out “app.ini” since it’s already defined globally, but surprisedly you would get the “UnboundLocalError” exception due to the variable app_config is referenced before assignment. If you comment out the assignment statement and just print out the variable, you would see the value printed out correctly. So what is going on here?

The above exception is due to Python tries to create a variable in local scope whenever there is an assignment expression, and since the local variable and global variable have the same name, the global variable being shadowed in local scope. Thus Python throws an error saying your local variable app_config is used before it’s initialized.

To solve this naming conflict, you shall use different name for your global variable and local variables to avoid any confusion, e.g.:

app_config = "app.ini"

def init_config():
    config = app_config or "default.ini"
    print(config)

Checking Falsy Values

Examining true or false of a variable in if or while statement sometimes can also go wrong. It’s common for Python beginners to mix None value and other falsy values and eventually write some buggy code. E.g.:  assuming you want to check when price is not None and below 5, trigger some selling alert:

def selling_alert(price):
    if price and price < 5:
        print("selling signal!!!")

Everything looks fine, but when you test with price = 0, you would not get any alert:

selling_alert(0)
# Nothing has been printed out

This is due to both None and 0 are evaluated as False by Python, so the printing statement would be skipped although price < 5 is true.

In python, empty sequence objects such as “” (empty string), list, set, dict, tuple etc are all evaluated as False, and also zero in any numeric format like 0 and 0.0. So to avoid such issue, you shall be very clear whether your logic need to differentiate the None and other False values and then split the logic if necessary, e.g.:

if price is None:
   print("invalid input")
elif price < 5:
   print("selling signal!!!")

Default Value and Variable Binding

Default value can be used when you want to make your function parameter optional but still flexible to change. Imagine you need to implement a logging function with an event_time parameter, which you would like to give a default value as current timestamp when it is not given. You can happily write some code as per below:

from datetime import datetime

def log_event_time(event, event_time=datetime.now()):
    print(f"log this event - {event} at {event_time}")

And you would expect as long as the event_time is not provided during log_event_time function call, it shall log an event with the timestamp when the function is invoked. But if you test it with below:

log_event_time("check-in")

# log this event - check-in at 2021-02-21 14:00:56.046938

log_event_time("check-out")

# log this event - check-out at 2021-02-21 14:00:56.046938

You shall see that all the events were logged with same timestamp. So why the default value for event_time did not work?

To answer this question, you shall know the variable binding happens during the function definition time. For the above example, the default value of the event_time was assigned when the function is initially defined. And the same value will be used each time when the function is called.

To fix the issue, you can assign a None as default value and check to overwrite the event_time inside your function call when it is None. For instance:

def log_event_time(event, event_time=None):
    event_time = event_time or datetime.now()
    print(f"log this event - {event} at {event_time}")

Similar variable binding mistakes can happens when you implement your lambda functions. For your further reading, you may check my previous post why your lambda function does not work for more examples.

Default Value for Mutable Objects

Another mistake Python beginners trend to make is to set a default value for a mutable function parameter. For instance, the below user_list parameter in the add_white_list function:

def add_white_list(user, user_list=[]):
    user_list.append(user)
    return user_list

You may expect when user_list is not given, a empty list will be created and then new user will be added into this list and return back. It is working as expected for below:

my_list = add_white_list('Jack')

# ['Jack']

my_list = add_white_list('Jill', my_list)

#['Jack', 'Jill']

But when you want to start with a empty list again, you would see some unexpected result:

my_new_list = add_white_list('Joe')
# ['Jack', 'Jill', 'Joe']

From the previous variable binding example, we know that the default value for user_list is created only once at the function definition time. And since list is mutable, the changes made to the list object will be referred by the subsequent function calls.

To solve this problem, we shall give None as the default value for user_list and use a local variable to create a new list when user_list is not given during the call. e.g.:

def add_white_list(user, user_list=None):
    if user_list is None:
        user_list = []
    user_list.append(user)
    return user_list

Someone may get confused that datetime.now() shall create a Python class instance, which supposed to be mutable also. If you checked Python documentation, you would see the implementation of datetime is actually immutable.

Misunderstanding of Python Built-in Functions

Python has a lot of powerful built-in functions and some of them look similar by names, and if you do not spend some time to read through the documentation, you may end up using them in the wrong way.

For instance, you know built-in sorted function or list sort function both can be used to sort sequence object. But occasionally, you may make below mistake:

random_ints = [80, 53, 7, 92, 30, 31, 42, 10, 42, 18]

# The sorting is done in-place, and function returns None
even_integers_first = random_ints.sort(key=lambda x: x%2)

# Sorting is not done in-place, function returns a new list
sorted(random_ints)

Similarly for reverse and reversed function:

# The reversing is done in-place, and function returns None
random_ints = random_ints.reverse()
# reversing is not done in-place, function returns a new generator
reversed(random_ints)

And for list append and extend function:

crypto = ["BTC", "ETH"]

# the new list will be added as 1 element to crypto list
crypto.append(["XRP", "BNB"])

print(crypto)
#['BTC', 'ETH', ['XRP', 'BNB']]

# the new list will be flattened when adding to crypto list
crypto.extend(["UNI"])

print(crypto)
# ['BTC', 'ETH', ['XRP', 'BNB'], 'UNI']

Modifying Elements While Iterating

When iterating a sequence object, you may want to filter out some elements based on certain conditions.

For instance, if you want to iterate below list of integers and remove any elements if it is below 5. You probably would write the below code:

a = [1, 2, 3, 4, 5, 6, 2]

for b in a:
    if b < 5:
        a.remove(b)

But when checking the output of the list a, you would see the result is not as per you expected:

print(a)
# [4, 5, 6, 2]

This is because the for statement will evaluate the expression and create a generator for iterating the elements. Since we are deleting elements from the original list, it will also change the state of the generator, and then further cause the unexpected result. To fix this issue, you can make use of the list comprehension as per below if your filter condition is not complex:

[b for b in a if b >= 5]

Or if you wish, you can use the the filterfalse together with the lambda function:

from itertools import filterfalse
list(filterfalse(lambda x: x < 5, a))

Re-iterate An Exhausted Generator

Many Python learners started writing code without understanding the difference between generator and iterator. This would cause the error that re-iterating an exhausted generator. For instance the below generator, you can print out the values in a list:

some_gen = (i for i in range(10))
print(list(some_gen))
# [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

And sometimes you may forget you have already iterated the generator once, and when you try to execute the below:

for x in some_gen:
    print(x)

You would not be able to see anything printed out.

To fix this issue, you shall save your result into a list first if you’re not dealing with a lot of data, or you can use the tee function from itertools module to create multiple copies of the generator so that you can iterate multiple times:

from itertools import tee

# create 3 copies of generators from the original iterable
x, y, z = tee(some_gen, n=3)

Conclusion

In this article, we have reviewed through some common Python mistakes that you may encounter when you start writing Python codes. There are definitely more mistakes you probably would make if you simply jump into the coding without understanding of the fundamentals. But as the old saying, no pain no gain, ultimately you will get there when you drill down all the mistakes and clear all the roadblocks.

Python one-liners with list comprehension and ternary operation

15 Most Powerful Python One-liners You Can’t Skip

Introduction

One-liner in Python refers to a short code snippet that achieves some powerful operations. It’s popular and widely used in Python community as it makes the code more concise and easier to understand. In this article, I will be sharing some most commonly used Python one-liners that would definitely speed up your coding without compromising any clarity.

Let’s start from the basis.

Ternary operations

Ternary operation allows you to evaluate a value based on the condition being true or false. Instead of writing a few lines of if/else statements, you can simply do it with one line of code:

x = 1
y = 2

result = 1 if x > 0 and y > x else -1
print(result)
# 1

#re-assign x to 6 if it is evaluated as False
x = x or 6

Assign values for multiple variables

You can assign values for multiple variables simultaneously as per below. (You may want to check this article to understand what is going on under the hood)

key, value = "user", "password"

print(key, value)
#('user', 'password')

Swap variables

To swap the values of the variables, simply perform the below without having a temp variable which is usually required by other programming languages like Java or C.

key, value = value, key

print(key, value) 
#('password', 'user')

Swap elements in a list

Imagine you have a list of users as per below, and you would like to swap the first element with last element:

users = ["admin", "anonymous1", "anonymous2"]

Since list is mutable, you can re-assign the values of first and last elements by swapping their sequence as per below:

users[0], users[2] = users[2], users[0]
# or users[0], users[-1] = users[-1], users[0]

print(users)
#['anonymous2', 'anonymous1', 'admin']

Further more, if the requirement is to swap the elements at the odd and even positions in a list, e.g. in the below list:

numbers = list(range(10))
#[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

We would like to swap the elements of the 2nd and 1st, 4th and 3rd, and so on. It can be achieved by performing the below list slicing with assignment operation:

numbers[::2], numbers[1::2] = numbers[1::2], numbers[0::2]

print(numbers)
#[1, 0, 3, 2, 5, 4, 7, 6, 9, 8]

Replace elements in a list

To further expend the above example, if we want to replace the elements on every odd/even position in a list, for instance to 0, we can do the re-assignment with below:

numbers[1::2] = [0]*len(numbers[1::2])

print(numbers)
#[0, 0, 2, 0, 4, 0, 6, 0, 8, 0]

Of course there is an alternative way with list comprehension, we shall touch on it later.

Generate list with list comprehension

By using list comprehension, you can easily generate new a list with certain filtering conditions from the current sequence object. For instance, the below will generate a list of even numbers between 1 to 20:

even_nums = [i for i in range(1, 20) if i%2 == 0]

print(even_nums)
#[2, 4, 6, 8, 10, 12, 14, 16, 18]

Create sub list from a list

Similarly, you can get a sub list from the existing list with the list comprehension as per below:

[i for i in even_nums if i <5]
# 2, 4

Manipulating elements in the list

With list comprehension, you can also transform your list of elements into another format. For instance, to convert the integers to alphabets:

alphabets = [chr(65+i) for i in even_nums]
# ['C', 'E', 'G', 'I', 'K', 'M', 'O', 'Q', 'S']

Or convert the upper case into lower case:

[i.lower() for i in alphabets]
#['c', 'e', 'g', 'i', 'k', 'm', 'o', 'q', 's']

And all the above can be done without list comprehension as well:

list(map(lambda x : chr(65+x), even_nums))
#['C', 'E', 'G', 'I', 'K', 'M', 'O', 'Q', 'S']

list(map(str.lower, alphabets))
#['c', 'e', 'g', 'i', 'k', 'm', 'o', 'q', 's']

Another real world example would be to use list comprehension to list out all the .ipynb files from current folder and its sub folders (excluding the checkpoint files):

import os

[f for d in os.walk(".") if not ".ipynb_checkpoints" in d[0]
             for f in d[2] if f.endswith(".ipynb")]

Flatten a list of sequences

If you have a list of sequence objects as per below, and you would like to flatten them into 1 dimensional:

a = [[1,2], [3,4], [5,6,7]]

You can use multiple for expressions in list comprehension to flatten it:

b = [y for x in a for y in x]

print(b)
#[1, 2, 3, 4, 5, 6, 7]

Alternatively, you can make use of the itertools module to get the same result:

import itertools

list(itertools.chain.from_iterable(a))

Ternary operation with list comprehension

In the previous ternary operation example, we have discussed how to replace the elements in the even position of a list. Here is the alternative using list comprehension in conjunction with a ternary expression:

numbers = list(range(10))
[y if i % 2 == 0 else 0 for i, y in enumerate(numbers)]
#[0, 0, 2, 0, 4, 0, 6, 0, 8, 0]

Generate a dictionary with dictionary comprehension

To derive a dictionary from a list, you can use dictionary comprehension as per below:

even_nums_dict = {chr(65+i):v for i, v in enumerate(even_nums)}

#{'A': 2, 'B': 4, 'C': 6, 'D': 8, 'E': 10, 'F': 12, 'G': 14, 'H': 16, 'I': 18}

Generate a set with set comprehension

Similar operation is also available for set data type when you want to derive elements from a list to set:

even_nums_set = {chr(65+i) for i in even_nums}
#{'C', 'E', 'G', 'I', 'K', 'M', 'O', 'Q', 'S'}

When using the built-in data type set, you shall expect that it only keeps the unique values. For instance, you can use set to remove duplicate values:

a = [1,2,2,4,6,7]
unique = set(a)

print(unique)
#{1, 2, 4, 6, 7}

More Python comprehension examples can be found here.

Read file into generator

Reading files can be done in one-liner as per below:

text = (line.strip() for line in open('response.html', 'r'))

Take note the parentheses are used in above generator expression rather than [], when [] is used, it returns a list.

One-liner with Python -c command

Sometimes you may want to run code snippets without entering into the Python interactive mode. You can execute the code with Python -c option in command line window. For instance, check the current Python version:

python -c "import sys; print(sys.version.split()[0])"

#3.7.2

Or check the value of the environment variable:

python -c "import os;print(os.getenv('PATH').split(';'))"

Conclusion

In this article, we have reviewed through some commonly used Python one-liners which would greatly improve your code readability and coding productivity. There are definitely more to be covered, and every Pythonista would have his/her own list of favorite one-liners. Sometimes you will also need to consider the code performance before you use it rather simply pursuing the conciseness of the code.

As the general rule of thumb, you shall not use/innovate something that confusing ,difficult to read or totally not benefiting either in readability or productivity.

web scraping with python requests and lxml

Web Scraping From Scratch With 3 Simple Steps

Introduction

Web scraping or crawling refers to the technique to extract the information from a website and transform into structured data for later analysis. There are generally a few reasons that you may need to implement a web scraping scripts to automate the data collection process:

  • There isn’t any public API available for you to get data from the source sites
  • The information is updated from time to time, such as the exchange rate, you cannot manage it in a manual way
  • The final data you need is piecemeal from multiple sites; and so on

Before you decide to implement a scraping script, you will also need to check to be sure that you are not violating the term of use for the data you are going to scrape. Some sites are against the scraping robot. This article is intended for education purpose to help you to understand the overall processes of web scraping, so we will assume you already know the implication of the web scraping and possible legal issues on how the data is used.

Scraping a website sometimes can be difficult depends on how the target website is designed and where the data is resided. But generally you can split the process into 3 steps. Let’s walk through them one by one.

Understand the structure of your target website

As the first step, you shall take a quick look at your target website to see how the front end interacts with the backend, and how the data is populated to the web page. To keep our example simple, let’s assume user authentication is not required and our target is to extract the price change for the top 20 cryptocurrencies from coindesk for further analysis.

The first thing we shall do is to understand how this information is organized on the website. Below is the screenshot of the data presented on the web page:

web scraping with python requests and lxml

In Chrome browser, if you right click on the web page to inspect the HTML elements, you shall see that the entire data table is under <section class=”cex-table”>…</section>. You can verify this by hovering your mouse to this element, you would see there is a light blue overlay on the data table as per below:web scraping in python with requests and lxml

Next, you may want to inspect each text field on the page to further understand how the table header and records are arranged. For instance, when you check the “Asset” text field, you would see the below HTML structure:

<section class="cex-table">
	<section class="thead">
		<div>...</div>
		<div class="tr-wrapper">
			<div class="tr-left">
				<div class="tr">
					<div>...</div>
					<div style="flex:7" class="th">
						<span class="cell">
						<i class="sorting-icon">
						</i>
						<span class="cell-text">Asset</span>
						</span>
					</div>
				</div>
			</div>
		</div>
		...
	</section>
</section>

And similarly you can find the structure of the first row in the table body as per below:

<section class="tbody">
	<section class="tr-section">
		<a href="/price/bitcoin">
			<div class="tr-wrapper">
				<div class="tr-left">
					<div class="tr">
						<div style="flex:2" class="td">
							<span class="cell cell-rank">
							<strong>01</strong>
							</span>
						</div>
						<div style="flex:7" class="td">
							<span class="cell cell-asset">
							<img>...</img>
							<strong class="cell-asset-title">Bitcoin</strong>
							<span class="cell-asset-iso">BTC</span>
							</span>
						</div>
					</div>
				</div>
			</div>
		</a>
	</section>
</section>

You may notice that majority of these HTML elements does not have a id or name attribute as the unique identifier, but the style sheet (“class” attribute) is quite consistent for the same row of data. So in this case, we shall consider to use the style sheet as a reference to find our data elements.

Locate and parse the target data element with XPath

With the initial understanding on HTML structure of our target website, we shall start to find a way to locate the data elements programmatically.

For this demonstration, we will use requests and lxml libraries to send the http requests and parse the results. There are other package for parsing DOM such as beautifulsoup, but personally I find using XPath expression is more straightforward when locating an element although the syntax may not as intuitive as the way beautifulsoup does.

Below is the pip command if you do not have these two packages installed:

pip install requests
pip install lxml

Let’s import the packages and send a GET request to our target URL:

import requests
from lxml import html

target_url = "https://www.coindesk.com/coindesk20"
result = requests.get(target_url)

Our target URL does not require any parameters, in case you need to pass in parameters, you can pass via the params argument as per below:

payload = {"q" : "bitcoin", "s" : "relevant"}
result = requests.get("https://www.coindesk.com/search", params=payload)

The result is a response object which has a status_code attribute to indicate if correct response has been returned from the target website. To simplify the code, let’s assume we can always get the correct response with the return HTML in string format from the text attribute.

We then pass our HTML string to lxml and use it to parse the DOM tree as per below:

tree = html.fromstring(result.text)

Now we come to the most important step, we will need to use XPath syntax to locate the data elements we want and extract the data out.

Since the id or name attributes are not available for these elements, we will need to use the style sheet to locate our data elements. To locate the table header, we need to perform the below:

  • Find the section tag with style sheet class as “cex-table” from the entire DOM
  • Find its child section node with style sheet class as “thead
  • Further find its child div node with style sheet as “tr-wrapper

Below is how the syntax looks like in XPath:

table_header = tree.xpath("//section[@class='cex-table']/section[@class='thead']/div[@class='tr-wrapper']")

It will scan through the entire DOM tree to find if any element matches this structure and return a list of nodes matched.

If everything goes well, the table_header list should only contain 1 element which is the div with “tr-wrapper” style sheet. Sometimes if it returns multiple nodes, you may need recheck your path expression to see how you can fine-tune it to get only the unique node that you need.

From the wrapper div, there are still a few levels before we can reach to the node with the text. But you may notice that all the data fields we need are under the span tag which has a style name “cell-text“. So we can actually locate all these span tags with CSS class and extract its text with text() function. Below is how it works in XPath expression:

headers = table_header[0].xpath(".//span[@class='cell']/span[@class='cell-text']/text()")

Note that “.” means to start from current node, and “//” is to indicate the following path expression is relative path

If you examine the headers now, you can see all the column headers are extracted into a list as per below:

['Asset',
 'Price',
 'Market Cap',
 'Total Exchange Volume',
 'Returns (24h)',
 'Total Supply',
 'Category',
 'Value Proposition',
 'Consensus Mechanism']

Let’s continue to move the table body. Following the same logic, we shall be able to locate to the section with “tr-section” in below syntax:

table_body = tree.xpath("//section[@class='cex-table']/section[@class='tbody']/section[@class='tr-section']")

This means that we have already collected all the nodes for rows in the table body. We can now loop through the rows to get the elements. We will use the style sheet to locate our elements, but for the “Asset” column, it actually contains a few child nodes with different style sheet, so we need to handle them separately from the rest of the columns. Below is the code to extract the data row by row and add it into a record list:

records = []
for row in table_body:    
    tokens = row.xpath(".//span[contains(@class, 'cell-asset-iso')]/text()")
    ranks = row.xpath(".//span[contains(@class, 'cell-rank')]/strong/text()")
    assets = row.xpath(".//span[contains(@class, 'cell-asset')]/strong/text()")
    spans = row.xpath(".//div[contains(@class,'tr-right-wrapper')]/div/span[contains(@class, 'cell')]")
    rest_cols = [span.text_content().strip() for span in spans]
    row_data = ranks + tokens + assets + rest_cols
    records.append(row_data)

Note that we are using “contains” in order to match the node with class like cell cell-rank“, and use text_content() to extract all the text from its current nodes and child nodes.

Occasionally you may find that the number of columns we extracted does not tally with the original column header due to header column merged or hidden, such as our above ranking and token ticker column. So let’s also give them column name as “Rank” and “Token”:

column_header = ["Rank", "Token"] + headers

Save the scraping result

With both the header and data ready, we can easily load the data into pandas as per below:

import pandas as pd
df = pd.DataFrame(records, columns=column_header)

You can see the below result in pandas dataframe, which looks pretty good except some formatting to be done to convert all the amount into proper number format.

web scraping to get cryptocurrency price

Or you can also write the scrapped data into a csv file with the csv module:

import csv
with open("token_price.csv", "w", newline="") as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(column_header)
    for row in records:
        writer.writerow(row)

Limitations & Constraints

In your real scraping project, you may encounter more complicated scenarios rather than directly getting the data from a GET request. So it’s better to understand how are the constraints/limitations for our above mentioned approach.

  • Go through the authentication process can be time-consuming with requests

If your target website requires authentication before you can retrieve the data, you may need to create a session and send multiple POST/GET requests to the server in order to get yourself authorized. Depends on how complicated the authentication process is, you will need to understand what are the parameters to be supplied and how the requests are chained together. This process may take some time and effort.

  • You cannot trigger JavaScript code to get your data

If the response from your target website returns some JavaScript code to populate the data, or you need to trigger some JavaScript function in order to have the data populated on the web page, you may find requests package simply would not work.

For both scenarios, you may consider to use selenium which I have mentioned in one of my past post. It has a headless mode where you can simulate user’s action such as key in user credentials or click buttons without actually showing the browser, and you can also execute JavaScript code to interact with the web page. The downside is that you will have to periodically upgrade your driver file to match with the browser’s version.

Conclusion

In this article, we have reviewed through a very basic example to scrape data with requests and lxml packages, and we have also discussed a few limitations where you may start looking for alternatives such as selenium or even the scrapy framework in case you have more complicated scenarios to be handled. No matter which libraries you choose to use, the fundamental remains the same. Hope this article gives you some hints on how to start your web scraping journey.