I ran across this neat little regex trick the other day. My goal was to match three kinds of lines with one regex and extract all of the data from line. The lines are formatted as follows:
4 STRETCH 4 1 2.0586112 1.0893702 5 BEND 4 1 2 1.9052943 109.1653223 6 TORSION 4 1 2 3 3.1415927 180.0000000
The first column is coordinate number (positive integer). The second column is a coordinate type (string). The third column through the sixth columns are atom numbers (positive integers). The seventh is a coordinate value in either bohrs or radians (floating point). The eighth column is a coordinate value in either angstroms or degrees (floating point).
"(\d+) +([A-Z]+) +(\d+) +(\d+) +(\d+)? +(\d+)? +(-?\d+\.\d+) +(-?\d+\.\d+)"
This regex has three distinct parts:
- \d+ represents any positive integer
- [A-Z]+ represents any string of all capital letters
- -?\d+\.\d+ represents any floating point number
regex_groups = re.search(LONG_REGEX,line.strip()) if regex_groups: regex_groups = list(regex_groups.groups()) if regex_groups[5]: # This is a torsion line elif regex_groups[4]: # This is a bend line elif regex_groups[3]: # This is a stretch line else: # This is some other line
The fifth and sixth columns are made optional in the regex with the ? character. The result was the ability to recognize and parse the contents of three similar, but distinct types of lines. I was also able to setup syntax highlighting for the blog.
I ran across this neat little regex trick the other day. My goal was to match three kinds of lines with one regex and extract all of the data from line. The lines are formatted as follows:
4 STRETCH 4 1 2.0586112 1.0893702 5 BEND 4 1 2 1.9052943 109.1653223 6 TORSION 4 1 2 3 3.1415927 180.0000000
The first column is coordinate number (positive integer). The second column is a coordinate type (string). The third column through the sixth columns are atom numbers (positive integers). The seventh is a coordinate value in either bohrs or radians (floating point). The eighth column is a coordinate value in either angstroms or degrees (floating point).
"(\d+) +([A-Z]+) +(\d+) +(\d+) +(\d+)? +(\d+)? +(-?\d+\.\d+) +(-?\d+\.\d+)"
This regex has three distinct parts:
- \d+ represents any positive integer
- [A-Z]+ represents any string of all capital letters
- -?\d+\.\d+ represents any floating point number
regex_groups = re.search(LONG_REGEX,line.strip()) if regex_groups: regex_groups = list(regex_groups.groups()) if regex_groups[5]: # This is a torsion line elif regex_groups[4]: # This is a bend line elif regex_groups[3]: # This is a stretch line else: # This is some other line
The fifth and sixth columns are made optional in the regex with the ? character. The result was the ability to recognize and parse the contents of three similar, but distinct types of lines. I was also able to setup syntax highlighting for the blog.