python split text with multiple delimiters

Python split text with multiple delimiters

There are cases you want to split a text which possibly use different symbols (delimiters) for separating the different elements, for instance, if the given text is in csv or tsv format, each field can be separated with comma (,) or tab (\t). You will need to write your code logic to support both delimiters. In this article, I will be sharing with you a few possible ways to split text with multiple delimiters in Python.

Checking if certain delimiter exists before splitting

If you are pretty sure the text will only contains one type of delimiter at a time, you can check if such delimiter exists before splitting. e.g. 

text = 'field1,field2,field3,field4'
#or 
text = 'field1;field2;field3;field4'

You can write a one-liner to check if comma exists before splitting by comma, otherwise splitting by semicolon.

text.split(",") if text.find(",") > -1 else text.split(";")

But if there are a lot of possible delimiters can be used in the text, or different delimiters can be mixed in the text, then writing the above if else logic will become very tedious work.  You might have thought about to use the replace function (see the full list of string functions from this article) to replace all the different delimiters into a single delimiter. It may work for your case, but it is far from a elegant solution.

So for such case, let’s move to the second option.

Using re to split text with multiple delimiters

In regular expression module, there is a split function which allows to split by pattern. You can specify all the possible delimiters with “|” to split the text with multiple delimiters at one time.

For instance, the below will extract the field1 to field5 into a list.

import re

text1 = "field1\tfield2,field3;field4 field5"
fields = re.split(r",|;|\s|\t", text1)

The result of fields will be list with all the data fields we want:

['field1', 'field2', 'field3', 'field4', 'field5']

What if you want to also keep these delimiters in the list for later use (e.g. reform back the text) ? You can use the capture groups () in the regular expression, so that the matched patterns will be also showing in the result.

fields = re.split(r'(,|;|\s|\t)', text1)

Result of fields variable:

['field1', '\t', 'field2', ',', 'field3', ';', 'field4', ' ', 'field5']

Conclusion

This quite common that we need write code to split text with multiple delimiters, and there are possibly other ways to solve this problem, but so far using the re.split still the most straightforward and efficient way.

You may also like

0 0 vote
Article Rating
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x