Resources

Best Tips for Python, Data Science and Automation

Resources

pandas convert columns to rows, convert wide to long, pandas melt

Pandas Tips - Convert Columns To Rows

  Introduction In one of my previous posts – Pandas tricks to split one row of data into multiple rows, we have discussed a solution to split the summary data from one row into multiple rows in order to standardize the data for further analysis. Similarly, there are many scenarios that we have the aggregated […]

Read More
Manipulate Audio File in Python, pydub, download youtube,cut video python

Manipulate Audio File in Python With 6 Powerful Tips

Introduction Dealing with audio files may not be that common to a Python enthusiast, but sometimes you may wonder if you are able to manipulate audio files in Python for your personal interest. For instance, if you really like some music, and you want to edit some parts of it and save into your phone, […]

Read More
Python generate QR code, Python read QR code, Photo by Lukas on Unsplash

Read and Generate QR Code With 5 Lines of Python Code

 Introduction QR Code is the most popular 2 dimensional barcodes that widely used for document management, track and trace in supply chain and logistics industry, mobile payment,  and even the “touchless” health declaration and contact tracing during the COVID-19 pandemic. Comparing to 1D barcode, QR code can be very small in size but hold more […]

Read More
20 Useful Tips for Using Python Pip

20 Tips for Using Python Pip

Introduction Python has become one of the most popular programming languages due to the easy to use syntax as well as the thousands of open-source libraries developed by the Python community. Almost every problem you want to solve, you can find a solution with these third-party libraries, so that you do not need to reinvent […]

Read More
reading email from outlook with python pywin32

5 Useful Tips for Reading Email From Outlook In Python

Introduction Pywin32 is one of the most popular packages for automating your daily work for Microsoft outlook/excel etc. In my previous post, we discussed about how to use this package to read emails and save attachments from outlook. As there were quite many questions raised in the comments which were not covered in the original […]

Read More
common python mistakes for beginners

8 Common Python Mistakes You Shall Avoid

Introduction Python is a very powerful programming language with easily understandable syntax which allows you to learn by yourself even you are not coming from a computer science background. Through out the learning journey, you may still make lots mistakes due to the lack of understanding on certain concepts. Learning how to fix these mistakes […]

Read More
auto switch browser tabs

How to auto switch browser tabs

Imagine you have a big monitor and you would like to display something from multiple web links, would it be nice if there is a way to auto switch between the multiple browser tabs in a fixed period? In this article, I will be sharing with you how to auto switch browser tabs via selenium, an automated testing tool.

There is a very detailed documentation on the python selenium library, you may want to check this document as the starting point. For this article, I will just walk through the complete code for this automation, so that you can use it as a reference in case you are tying to implement something similar.

Let’s get started!

To auto launch the browser, we need to first download the web driver for the browser. For instance, if you are using chrome browser, you may download the driver file here. Do check your browser version to make sure you download the driver for the correct version.

As the prerequisite, you will also need to run the below command to install the selenium package in your working environment.

pip install selenium

Launch the browser

Then import all the necessary modules into your script. For this article, we will need to use the below modules:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.common.exceptions import SessionNotCreatedException

import time
import os, sys

Let’s assume we want to display the below 3 links in your browser and make them auto switching between each other:

url_1 = "https://www.google.com/maps/@1.3085909,103.8403575,14z"
url_2 = "https://weather.com/en-SG/weather/today"
url_3 = "https://edition.cnn.com/"

Assuming you’ve already downloaded the chrome driver file and put it into the current script folder. Then let’s start to initiate the web driver to launch the browser:

options = Options()
options.add_experimental_option('useAutomationExtension', False)

try:	
	driver = webdriver.Chrome(executable_path=os.getcwd() + "\\chromedriver.exe", options=options)
except SessionNotCreatedException as e:
	print(e)
	print("please upgrade the chromedriver.exe from https://chromedriver.chromium.org/downloads")
	sys.exit(1)

You may wonder why we need a options parameter here?  It’s actually optional, but you may see the “Loading of unpacked extensions is disabled by the administrator” warning without setting useAutomationExtension to False. There are plenty of other options to control the browser behavior, check here for the documentation.

As frequently you will see there is a new version of chrome, and it may not work with old driver file anymore. So, it’s better we catch this exception and show some error message to guide users to upgrade the driver.

You can set the chrome window position by doing the below, but it does not matter if you wish to maximize the window later.

driver.set_window_position(2000, 1)

Let’s open the first link and maximize our window (This also can be done by options.addArguments("start-maximized")). And we want to execute some JavaScript to zoom out a bit so that we can see clearly.

#open window 1
driver.get(url_1)
driver.maximize_window()
driver.execute_script("document.body.style.zoom='120%'")
time.sleep(1)

To open the second tab, we need to use JavaScript to open a blank tab, and switch the active tab to the second tab. The driver.window_handles keeps a list of handlers for the opened windows, so window_handles[1] refers to the second tab.

driver.execute_script("window.open('');")
driver.switch_to.window(driver.window_handles[1])

Next, we will open the second link. And for this tab, let’s scroll down 300px to skip the ads second at the page header.

#open second link
driver.get(url_2)
driver.execute_script("document.body.style.zoom='90%'")
driver.execute_script("window.scrollBy(0,300);")
time.sleep(1)

Similarly, we can open the third tab with the below code:

#open window 3
driver.execute_script("window.open('');")
driver.switch_to.window(driver.window_handles[2])
driver.get(url_3)		
driver.execute_script("document.body.style.zoom='90%'")
driver.execute_script("window.scrollBy(0,200);")
time.sleep(1)

Auto switch between tabs

Once everything is ready, we shall write the logic to auto switch between the different tabs at certain interval. To do that, we need to know how to perform the below 3 things:

  • Identify what is the active link showing now

We can use driver.title attribute to check if the page title contains certain keyword for the particular website, so that we know which page is active now

  • Switch to a new tab

We can continue to use driver.switch_to.window to switch the tab, but we need to have logic to determine which is the next tab we want to switch to

  • Refresh the page (in case there is any updates)

We can use driver.refresh() to refresh the page, but we will lose the setting such as zooming in/out, so we need to set it again

So let’s take a look at the complete code:

nextIndex = 2

start = time.time()

while True:
	
	#stop running after 5 minutes
	if (time.time() - start >= 5*60):
		break
		
	if "Google Maps" in driver.title:
		driver.refresh()
		driver.execute_script("document.body.style.zoom='120%'")
		time.sleep(3)
		nextIndex = 0 if nextIndex + 1 > 2 else nextIndex + 1
		
	elif "CNN" in driver.title:
		driver.refresh()
		driver.execute_script("document.body.style.zoom='90%'")
		time.sleep(5)
		nextIndex = 0 if nextIndex + 1 > 2 else nextIndex + 1
		
	elif "Weather" in driver.title:
		driver.refresh()
		driver.execute_script("document.body.style.zoom='90%'")
		time.sleep(2)
		nextIndex = 0 if nextIndex + 1 > 2 else nextIndex + 1
		
	driver.switch_to.window(driver.window_handles[nextIndex])

So each of the tab will be active for a few seconds before switching to the next tab. And after 5 minutes, this loop will be stopped.

If we wish to close all tabs at the end of the script, we can perform the below:

for window in driver.window_handles:
	driver.switch_to.window(window)
	driver.close()

So that’s it and congratulations that you have completed a new automation project to auto switch browser tabs for Chrome. As per always, welcome any comments or questions.

python send email with attachment via smtplib

How to send email with attachment via python smtplib

In one of my previous article, I have discussed about how to send email from outlook application. That has assumed you have already installed outlook and configured your email account on the machine where you want to run your script. In this article, I will be sharing with you how to automatically send email with attachments via lower level API, to be more specific, by using python smtplib where you do not need to set up anything in your environment to make it work.

For this article, I will demonstrate to you to send a HTML format email from a gmail account with some attachment. So besides the smtplib module, we will need to use another two modules – ssl and email.

Let’s get started!

First, you will need to find out the SMTP server and port info to send email via google account. You can find this information from this link. For your easy reading, I have captured in the below screenshot.

codeforests - google smtp server configuration info

So we are going to use the server: smtp.gmail.com and port 587 for our case. (you may search online to find out more info about the SSL & TLS, we will not discuss much about it in this article)

Let’s start to import all the modules we need:

import smtplib, ssl
from email.mime.multipart import MIMEMultipart 
from email.mime.text import MIMEText 
from email.mime.application import MIMEApplication

As we are going to send the email in HTML format (which are you able to unlock a lot features such as adding in styles, drawing tables etc.), we will need to use the MIMEText. And also the MIMEMultipart and MIMEApplication for the attachment.

Build up the email message

To build up our email message, we need to create mixed type MIMEMultipart object so that we can send both text and attachment. And next, we shall specify the from, to, cc and subject attributes.

smtp_server = 'smtp.gmail.com'
smtp_port = 587 
#Replace with your own gmail account
gmail = 'yourmail@gmail.com'
password = 'your password'

message = MIMEMultipart('mixed')
message['From'] = 'Contact <{sender}>'.format(sender = gmail)
message['To'] = 'contact@company.com'
message['CC'] = 'contact@company.com'
message['Subject'] = 'Hello'

You probably do not want anybody can see your hard coded password here, you may consider to put this email account info into a separate configuration file. Check my another post on the read/write configuration files.

For the HTML message content, we will wrap it into the MIMEText, and then attach it to our MIMEMultipart message:

msg_content = '<h4>Hi There,<br> This is a testing message.</h4>\n'
body = MIMEText(msg_content, 'html')
message.attach(body)

Let’s assume you want to attach a pdf file from your c drive, you can read it in binary mode and pass it into MIMEApplication with MIME type as pdf. Take note on the additional header where you need to specify the name your attachment file.

attachmentPath = "c:\\sample.pdf"
try:
	with open(attachmentPath, "rb") as attachment:
		p = MIMEApplication(attachment.read(),_subtype="pdf")	
		p.add_header('Content-Disposition', "attachment; filename= %s" % attachmentPath.split("\\")[-1]) 
		message.attach(p)
except Exception as e:
	print(str(e))

If you have a list of the attachments, you can loop through the list and attach them one by one with the above code.

Once everything is set properly, we can convert the message object into to a string:

msg_full = message.as_string()

Send email

Here comes to the most important part, we will need to initiate the TLS context and use it to communicate with SMTP server.

context = ssl.create_default_context()

And we will initialize the connection with SMTP server and set the TLS context, then start the handshaking process.

Next it authenticate our gmail account, and in the send mail method, you can specify the sender, to and cc (as a list), as well as the message string. (cc is optional)

with smtplib.SMTP(smtp_server, smtp_port) as server:
	server.ehlo()  
	server.starttls(context=context)
	server.ehlo()
	server.login(gmail, password)
	server.sendmail(gmail, 
				to.split(";") + (cc.split(";") if cc else []),
				msg_full)
	server.quit()

print("email sent out successfully")

Once sendmail completed, you will disconnect with the server by server.quit().

With all above, you shall be able to receive the email triggered from your code. You may want to wrap these codes into a class, so that you can reuse it as service library in your multiple projects.

 

As per always, please share if you have any questions or comments.

python cache

How to print colored message on command line terminal window

When you are developing a python script with some output messages printed on the terminal window, you may find a little bit boring that all the messages are printed in black and white, especially if some messages are meant for warning, and some just for information only. You may wonder how to print colored message to make them look differently, so that your users are able to pay special attention to those warning or error messages.

In this article, I will be sharing with you a library which allows you to print colored message in your terminal.

Let’s get started!

The library I am going to introduce called colorama, which is a small and clean library for styling your messages in both Windows, Linux and Mac os.

Prerequisite :

You will need to install this library, so that you will be able to run the following code in this article.

pip install colorama

To start using this library, you will need to import the modules, and call the init() method at the beginning of your script or your class initialization method.

import colorama
from colorama import Fore, Back, Style
colorama.init()

Print colored message with colorama

The init method also accepts some **kwargs to overwrite it’s default behaviors. E.g. by default, the style will not be reset back after printing out a message,  and the subsequent messages will be following the same styles. You can pass in autoreset = true to the init method, so that the style will be reset after each printing statement.

Below are the options you can use when formatting the font, background and style.

Fore: BLACK, RED, GREEN, YELLOW, BLUE, MAGENTA, CYAN, WHITE, RESET.
Back: BLACK, RED, GREEN, YELLOW, BLUE, MAGENTA, CYAN, WHITE, RESET.
Style: DIM, NORMAL, BRIGHT, RESET_ALL

To use it in your message, you can do as per below to wrap your messages with the styles:

print(Fore.CYAN + "Cyan messages will be printed out just for info only" + Style.RESET_ALL)
print(Fore.RED + "Red messages are meant to be to warning or error" + Style.RESET_ALL)
print(Fore.YELLOW + Back.GREEN +  "Yellow messages are debugging info" + Style.RESET_ALL)

This is how it would look like in your terminal:

Python printed colored message with colorama

As I mentioned earlier, if you don’t set the autoreset to true, you will need to reset the style at the end of your each message, so that different message applies different styles.

What if you want to apply the styles when asking user’s input ? Let’s see an example:

print(Fore.YELLOW)
choice = input("Enter YES to confrim:")
print(Style.RESET_ALL)
if str.upper(choice) in ["YES",'Y']:
    print(Fore.GREEN + "You have just confirmed to proceed." + Style.RESET_ALL)
else:
    print(Fore.RED + "You did not enter yes, let's stop here" + Style.RESET_ALL)

By wrapping the input inside Fore.YELLOW and Style.RESET_ALL, whatever output messages from your script or user entry, the same style will be applied.

Let’s put all the above into a script and run it in the terminal to check how it looks like.

Python printed colored message with colorama

Yes, that’s exactly what we want to achieve! Now you can wrap your printing statement into a method e.g.: print_colored_message, so that you do not need to repeat the code everywhere.

As per always, please share if you have any comments or questions.

 

python unpack objects

Python how to unpack tuple, list and dictionary

There are various cases that you want to unpack your python objects such as tuple, list or dictionary into individual variables, so that you can easily access the individual items. In this article I will be sharing with you how to unpack these different python objects and how it can be useful when working with the *args and **kwargs in the function.

Let’s get started.

Unpack python tuple objects

Let’s say we have a tuple object called shape which describes the height, width and channel of an image, we shall be able to unpack it to 3 separate variables by doing below:

shape = (500, 300, 3)
height, width, channel = shape
print(height, width, channel)

And you can see each item inside the tuple has been assigned to the individual variables with a meaningful name, which increases the readability of your code. Below is the output:

500 300 3

It’s definitely more elegant than accessing each items by index, e.g. shape[0], shape[1], shape[2].

What if we just need to access a few items in a big tuple which has many items? Here we need to introduce the _ (unnamed variable) and * (unpack arbitrary number of items)

For example,  if we just want to extract the first and the last item from the below tuple, we can let the rest of the items go into a unnamed variable.

toto_result = (4,11,14,23,28,47,24)
first, *_, last = toto_result
print(first, last)

So the above will give the below output:

4 24

If you are curious what is inside the “_”, you can try to print it out. and you would see it’s actually a list of the rest of items between the first and last item.

[11, 14, 23, 28, 47]

The most popular use case of the packing and unpacking is to pass around as parameters to function which accepts arbitrary number of arguments (*args). Let’s look at an example:

def sum(*numbers):
    total = 0
    for n in numbers:
        total += n
    return total

For the above sum function, it accepts any number of arguments and sum up the values. The * here is trying to pack all the arguments passed to this function and put it into a tuple called numbers. If you are going to sum up the values for all the items in toto_result, directly pass in the toto_result would not work.

toto_resut = (4,11,14,23,28,47,24)
#sum(toto_result) would raise TypeError

So what we can do is to unpack the items from the tuple then pass it the sum function:

total = sum(*toto_resut)
print(total)
#output should be 151

Unpack python list objects

Unpacking the list object is similar to the unpacking operations on tuple object. If we replace the tuple to list in the above example, it should be working perfectly.

shape = [500, 300, 3]
height, width, channel = shape
print(height, width, channel)
#output shall be 500 300 3

toto_result = [4,11,14,23,28,47,24]
first, *_, last = toto_result
print(first, last)
#output shall be 4 24

total = sum(*toto_resut) 
print(total) 
#output should be also 151

Unpack python dictionary objects

Unlike the list or tuple, unpacking the dictionary probably only useful when you wants to pass the dictionary as the keyword arguments into a function (**kwargs).

For instance, in the below function, you can pass in all your keyword arguments one by one.

def print_header(**headers):
    for header in headers:
        print(header, headers[header])

print_header(Host="Mozilla/5.0", referer = "https://www.codeforests.com")

Or if you have a dictionary like below, you can just unpack it and pass to the function:

headers = {'Host': 'www.codeforests.com', 'referer' : 'https://www.codeforests.com'}
print_header(**headers)

It will generate the same result as previously, but the code is more concise.

Host www.codeforests.com
referer https://www.codeforests.com

With this unpacking operator, you can also combine multiple dictionaries as per below:

headers = {'Host': 'www.codeforests.com', 'referer' : 'https://www.codeforests.com'}
extra_header = {'user-agent': 'Mozilla/5.0'}

new_header = {**headers, **extra_header}

The output of the new_header will be like below:

{'Host': 'www.codeforests.com',
 'referer': 'https://www.codeforests.com',
 'user-agent': 'Mozilla/5.0'}

Conclusion

The unpacking operation is very usefully especially when dealing with the *args and **kwargs. There is one thing worth noting on the unamed variable (_) which I mentioned in the previous paragraph. Please use it with caution, as if you notice, the python interactive interpreter also uses _ to store the last executed expression. So do take note on this potential conflict. See the below example:

codeforests interactive interpreter conflicts

As per always, welcome any comments or questions.

Get file names by extension from a directory

Whenever you access the directories and files, you probably will need to implement some function to get file names by file extension from a particular directory. For instance, you may want to check and process all the excel files in a folder, or do a house keeping to remove all the old log files. In this article, I will be explaining to you a few ways of implementing such function.

Let’s get started!

There are actually plenty of libraries/modules you can use to achieve it, but let’s start with the most commonly used libraries/modules.

Option 1

Since you will need to import the os module anyway if you need to handle the file operations, you can make use of the functions from this module.

For instance, you can list out all the files/sub-directories under the current directory,  and check if file name ending with certain file extension as per below:

import os

pyfiles = []
for file in os.listdir("."):
    if file.lower().endswith(".ipynb"):
        pyfiles.append(file)

You can further sort the files by last modified time from latest to the earliest.

pyfiles.sort(key=os.path.getmtime, reverse=True)

What if you want to check multiple file extensions ? Don’t worries, you can still achieve it by some minor change on the if condition:

if file.lower().endswith((".ipynb", ".xlsx")):

Option 2

The os module also has another method scandir which is able to achieve the same, and also returns the file types and file attribute info.

files = []
for file in os.scandir("."):
    if file.name.lower().endswith((".ipynb", ".xlsx")):
        files.append(file.name)

 

Option 3

If you don’t like the way to match the file names in the above code, you can use fnmatch to do this job. for example: 

import fnmatch
files = []
for file in os.listdir("."):
    if fnmatch.fnmatch(file, "*.ipynb") or fnmatch.fnmatch(file, "*.xlsx"):
        files.append(file)

 

Option 4

Python has a glob module you can use the Unix style of pattern to match the files. To match the files with certain extension, you can simply do the below:

import glob
files = glob.glob("*.ipynb")

And then sort by the file creation from the latest to the earliest:

files.sort(key=os.path.getctime, reverse=True)

if you want match for multiple file extensions, you can do something as below:

files = []
file_types = ("*.ipynb", "*.xlsx")
for file_type in file_types:
    files.extend(glob.glob(file_type))

files.sort(key=os.path.getctime, reverse=True)

As I mentioned earlier, there are far more ways of doing it and it would not be possible to list of all them, so I will just stop here, and please leave your comments if you have better ideas.

 

How to swap key and value in a python dictionary

There are cases that you may want to swap key and value pair in a python dictionary, so that you can do some operation by using the unique values in the original dictionary.

For instance, if you have the below dictionary:

contact = {"joe" : "contact@company.com", "john": "john@company.com"}

you can swap key and value of the dictionary by:

contact = {val : key for key, val in contact.items()}
print(contact)

You will see the below output:

{'contact@company.com': 'joe', 'john@company.com': 'john'}

But for the above dictionary, if multiple names sharing the same email address, then only one name will be retained. e.g. :

contact = {"joe" : "contact@company.com", "jane" : "contact@company.com", "john": "john@company.com"}
contact = {val : key for key, val in contact.items()}

Output of the contact dictionary will be :

{'contact@company.com': 'jane', 'john@company.com': 'john'}

So how to keep all the keys that have the same value after reversing it ?

You will need to use a list or set to collect all the keys if the value is the same, e.g.:

email_contact = {}
for key, val in contact.items():
    email_contact.setdefault(val, []).append(key)

(please refer to this article about the setdefault method)

And you will see the below output for the new dictionary email_contact:

{'contact@company.com': ['joe', 'jane'], 'john@company.com': ['john']}

That’s exactly what we want ! Now we shall be able to say “hi” to both Joe and Jane when sending email to contact@company.com without missing any names.

 

As per always, welcome any comments or questions.

python dictionary keyerror

Handling the KeyError for python dictionary

python dictionary KeyError

The KeyError is quite commonly seen when dealing with the dictionary objects. when trying to access the dictionary while the key does not exists, then this error will be showing up. Usually to avoid this error, we will need to check if the key exists before accessing the value.

For instance, you can check if the key “country” exists in my_dict and then check if the values is “SGP” like the below. But the code does not look elegant.

my_dict = {"name" : "National University of Singapore", "address" : "21 Lower Kent Ridge Rd Singapore", "contact": "68741616"}
if my_dict.get("country") and my_dict["country"] == "SGP":
    print(f"country code is {my_dict['country']}")

You may also see someone uses the below way to make the code more concise. To pass in a default value if the key does not exists:

if my_dict.get("country", "") == "SGP":
    print(f"country code is {my_dict['country']}")

The Zen of Python tells us

Explicit is better than implicit.

So the above code actually does not follow this principal. If you go through the python documentation for dictionary, there is indeed a way to get the value of the key and meanwhile setting a default value if the key is new to the dictionary. Below code shows how it works:

if my_dict.setdefault("country", "") == "SGP":
    print(f"country code is {my_dict['country']}")

By doing the above, the key “country” will be added into the my_dict with a default value if the key does not exists previously, and then return the value of this key.

To extend the above setdefault method, if the value is a list of objects, you can also use this method to initialize it and then set the value.

my_dict.setdefault("faculty", []) # use list or set()
my_dict["faculty"].append("Arts")
my_dict["faculty"].append("Computer Science")

 

As per always, welcome for any comments or questions.

 

python send email from outlook

How to send email from outlook in python

In the previous article, I have explained how to read and save attachments from the outlook by using pywin32 library. In this article, I will walk through with you how to send email from outlook with the same library.

Prerequisite:

You need to install the pywin32 library in your working environment.

pip install pywin32

and import this library in your script.

import win32com.client

Let’s get started!

You will first need to initiate the outlook application by calling the below:

outlook = win32com.client.Dispatch('outlook.application')

In outlook, email, meeting invite, calendar, appointment etc. are all considered as Item object. Hence we can use the below to create an email object:

mail = outlook.CreateItem(0)

for this mail item, there are various attributes we can set, such as the below To, CC, BCC, Subject, Body, HTMLBody etc. as well as the Attachments:

mail.To = 'contact@company.com'
mail.Subject = 'Sample Email'
mail.HTMLBody = '<h3>This is HTML Body</h3>'
mail.Body = "This is the normal body"
mail.Attachments.Add('c:\\sample.xlsx')
mail.Attachments.Add('c:\\sample2.xlsx')
mail.CC = 'somebody@company.com'

You can add multiple attachments by calling the Attachments.Add multiple times.

Trigger to send out email from outlook

With the above attributes set, you shall be able to send out the email since all the necessary info are provided. Below line of code will trigger to send email from outlook application.

mail.Send()

You may also wonder what if you just want to reply to a particular email instead of writing new email? In this case, you will need to find out the email message first and then use the message.Reply() or message.ReplyAll() to reply to the original message. Do check on my this article.

Conclusion:

This is just a sample demo of how to send emails, and there are plenty of things you can do with pywin32 library, do check my other related articles, such as this.

Last but not the least, welcome to any comments or questions. Follow me on twitter for more updates.

Fix the CompDocError when reading excel file with xlrd

CompDocError

You may have seen this CompDocError before if you used python xlrd library to read the older version of the excel file (.xls). When directly opening the same file from Microsoft Excel, it is able to show the data properly without any issue.

This usually happens if the excel file is generated from 3rd party application, the program did not follow strictly on the Microsoft Excel standard format, although the file is readable by Excel but it fails when opening it with xlrd library due to the non-standard format or missing some meta data. As you may have no control on how the 3rd party application generate the file, you will need to find a way to handle this CompDocError in your code.

 

SOLUTIONS FOR COMPDOCERROR

 

Option 1:

If you look at the error message, the error raised from  the line 427 in the compdoc.py in your xlrd package. Since you confirm there is no problem with the data in your excel file except the minor format issue, you can open the compdoc.py and comment out the lines for raising CompDocError exception.

while s >= 0:
    if self.seen[s]:
        pass
        #print("_locate_stream(%s): seen" % qname, file=self.logfile); dump_list(self.seen, 20, self.logfile)
        #raise CompDocError("%s corruption: seen[%d] == %d" % (qname, s, self.seen[s]))

Option 2:

You may notice if you open your file in Microsoft Excel and save it, you will be able to use xlrd to read and no exception will be raised. This is because Excel already fixed the issues for you when saving the file. You can use the same approach in your code to fix this problem.

To do that, you can use the pywin32 library to open the native Excel application and re-save the file.

 

import win32com.client as win32

excel_app = win32.Dispatch('Excel.Application')
wb = excel_app.Workbooks.open("test.xls")
excel_app.DisplayAlerts = False #do not show any alert when closing the excel
wb.Save()
excel_app.quit()

 

Conclusion

 

For option 1, it is good if your program only reads the files generated from the same source. If your program needs to read different excel files from different sources, it may not be a good to always assume the “CompDocError” can be ignored.

 

For option 2, when calling the excel_app.quit(), the entire Excel application will be closed without any alert. If you have other excel files opening at the time, it will be all closed together. So this solution is good if your program will run in a standalone environment or you confirm no other process/people will be using excel when running your code.

 

If you would like to understand more about how to read & write excel file with xlrd, please check this article.