regex in python

I ran across this neat little regex trick the other day. My goal was to match three kinds of lines with one regex and extract all of the data from line. The lines are formatted as follows:

4 STRETCH   4  1                     2.0586112       1.0893702
5 BEND      4  1  2                  1.9052943     109.1653223
6 TORSION   4  1  2  3               3.1415927     180.0000000

The first column is coordinate number (positive integer). The second column is a coordinate type (string). The third column through the sixth columns are atom numbers (positive integers). The seventh is a coordinate value in either bohrs or radians (floating point). The eighth column is a coordinate value in either angstroms or degrees (floating point).

"(\d+) +([A-Z]+) +(\d+) +(\d+) +(\d+)? +(\d+)? +(-?\d+\.\d+) +(-?\d+\.\d+)"

This regex has three distinct parts:

 

  1. \d+ represents any positive integer
  2. [A-Z]+ represents any string of all capital letters
  3. -?\d+\.\d+ represents any floating point number
regex_groups = re.search(LONG_REGEX,line.strip())
if regex_groups:
	regex_groups = list(regex_groups.groups())
	if regex_groups[5]:
		# This is a torsion line
	elif regex_groups[4]:
		# This is a bend line
	elif regex_groups[3]:
		# This is a stretch line
	else:
		# This is some other line
	

The fifth and sixth columns are made optional in the regex with the ? character. The result was the ability to recognize and parse the contents of three similar, but distinct types of lines. I was also able to setup syntax highlighting for the blog.

I ran across this neat little regex trick the other day. My goal was to match three kinds of lines with one regex and extract all of the data from line. The lines are formatted as follows:

4 STRETCH   4  1                     2.0586112       1.0893702
5 BEND      4  1  2                  1.9052943     109.1653223
6 TORSION   4  1  2  3               3.1415927     180.0000000

The first column is coordinate number (positive integer). The second column is a coordinate type (string). The third column through the sixth columns are atom numbers (positive integers). The seventh is a coordinate value in either bohrs or radians (floating point). The eighth column is a coordinate value in either angstroms or degrees (floating point).

"(\d+) +([A-Z]+) +(\d+) +(\d+) +(\d+)? +(\d+)? +(-?\d+\.\d+) +(-?\d+\.\d+)"

This regex has three distinct parts:

 

  1. \d+ represents any positive integer
  2. [A-Z]+ represents any string of all capital letters
  3. -?\d+\.\d+ represents any floating point number
regex_groups = re.search(LONG_REGEX,line.strip())
if regex_groups:
	regex_groups = list(regex_groups.groups())
	if regex_groups[5]:
		# This is a torsion line
	elif regex_groups[4]:
		# This is a bend line
	elif regex_groups[3]:
		# This is a stretch line
	else:
		# This is some other line
	

The fifth and sixth columns are made optional in the regex with the ? character. The result was the ability to recognize and parse the contents of three similar, but distinct types of lines. I was also able to setup syntax highlighting for the blog.