regex in python

I ran across this neat little regex trick the other day. My goal was to match three kinds of lines with one regex and extract all of the data from line. The lines are formatted as follows:

4 STRETCH   4  1                     2.0586112       1.0893702
5 BEND      4  1  2                  1.9052943     109.1653223
6 TORSION   4  1  2  3               3.1415927     180.0000000

The first column is coordinate number (positive integer). The second column is a coordinate type (string). The third column through the sixth columns are atom numbers (positive integers). The seventh is a coordinate value in either bohrs or radians (floating point). The eighth column is a coordinate value in either angstroms or degrees (floating point).

"(\d+) +([A-Z]+) +(\d+) +(\d+) +(\d+)? +(\d+)? +(-?\d+\.\d+) +(-?\d+\.\d+)"

This regex has three distinct parts:

 

  1. \d+ represents any positive integer
  2. [A-Z]+ represents any string of all capital letters
  3. -?\d+\.\d+ represents any floating point number
regex_groups = re.search(LONG_REGEX,line.strip())
if regex_groups:
	regex_groups = list(regex_groups.groups())
	if regex_groups[5]:
		# This is a torsion line
	elif regex_groups[4]:
		# This is a bend line
	elif regex_groups[3]:
		# This is a stretch line
	else:
		# This is some other line
	

The fifth and sixth columns are made optional in the regex with the ? character. The result was the ability to recognize and parse the contents of three similar, but distinct types of lines. I was also able to setup syntax highlighting for the blog.

I ran across this neat little regex trick the other day. My goal was to match three kinds of lines with one regex and extract all of the data from line. The lines are formatted as follows:

4 STRETCH   4  1                     2.0586112       1.0893702
5 BEND      4  1  2                  1.9052943     109.1653223
6 TORSION   4  1  2  3               3.1415927     180.0000000

The first column is coordinate number (positive integer). The second column is a coordinate type (string). The third column through the sixth columns are atom numbers (positive integers). The seventh is a coordinate value in either bohrs or radians (floating point). The eighth column is a coordinate value in either angstroms or degrees (floating point).

"(\d+) +([A-Z]+) +(\d+) +(\d+) +(\d+)? +(\d+)? +(-?\d+\.\d+) +(-?\d+\.\d+)"

This regex has three distinct parts:

 

  1. \d+ represents any positive integer
  2. [A-Z]+ represents any string of all capital letters
  3. -?\d+\.\d+ represents any floating point number
regex_groups = re.search(LONG_REGEX,line.strip())
if regex_groups:
	regex_groups = list(regex_groups.groups())
	if regex_groups[5]:
		# This is a torsion line
	elif regex_groups[4]:
		# This is a bend line
	elif regex_groups[3]:
		# This is a stretch line
	else:
		# This is some other line
	

The fifth and sixth columns are made optional in the regex with the ? character. The result was the ability to recognize and parse the contents of three similar, but distinct types of lines. I was also able to setup syntax highlighting for the blog.

Research

Every time I tell someone, "I am working as a full time research assistant this summer" I get blank stares or awkward grins. I thought I should explain what I do for research. Molecules have energy. I am studying some exotic molecules using computers to calculate energies. I want to find what makes a molecules energy a minimum and what makes a maximum.

A majority of the work I do is preparing input files for the computer to perform calculations on. Once the computer completes the calculations(sometimes this takes over two hours) I exam the output to make sure that everything looks reasonable. Repeat this process several hundred times and I will hopefully have enough data for my professor to finish his paper.

While I am waiting for the computer to finish its calculations I am writing a python script to automate my job. The script will eventually save me 2 to 5 minutes for each file I make. It will also automatically submit files to the computer around the clock. 

This job has been incredibly enjoyable. It is the perfect blend of Chemistry and Computer Science. It is just the right fit for me. So instead of giving me a blank stare or awkward grin you can now say with a broad smile, "I am glad you enjoy being a nerd and doing cool nerdy things!"

Regular Dilemmas

The most exciting part of trying something new is the unexpected. I started work on step four of my process(copy file0.inp to file1.inp) and encountered an unexpected puzzle. The puzzle was how to increment a number embedded in a string. I as a human I can recognize the three parts of the name file0.inp. Any human can see that they are file, 0, and .inp. The trick was turning this intuition into an algorithm. I did some quick googling and found an algorithm that did exactly what I wanted. The algorithm relied on regular expression matching. I was familiar with the concept of regular expression matching as a super extension of scanf. Even though I had an algorithm that worked I wanted to make sure I at least understood the basics of regular expressions. The long and short of it was that I spent 2 hours writing a "simple program" that copied one file to another. But I also learned basic python regular expression syntax (more on that latter). Now the next time I write code like this it won't even take me two minutes. I also see potential for using regular expressions on some of the other steps of the process.

A Process to Automate

To summarize: I use GAMESS to perform relaxed scans. To perform a relaxed scan I have to copy by hand text from multiple files to each input file and submitted them to GAMESS for processing. As it stands the process for performing a relaxed scan on a prepared input file is listed below:

1. Take Start0.inp and add DET group
2. execute "rungms Start0.inp > Start0.out"
3. Copy Start0.inp to Start1.inp
4. Copy equilibrium geometry from Start0.out to Start1.inp
5. Copy The Molecular orbitals from Start0.dat to Start1.inp
6. Add The freeze command to Start1.inp
7. Increment the frozen coordinate in Start1.inp
8. execute "rungms Start1.inp > Start1.out"
9. Copy Start1.inp to Start2.inp
10. Copy equilibrium geometry from Start1.out to Start2.inp
11. Copy The Molecular orbitals from Start1.dat to Start2.inp
12. Increment the frozen coordinate in Start2.inp
13. Execute "rungms Start2.inp > Start1.out"
14. Repeat step 9-13 approximately 20-50 times


The only step that requires a decent amount of chemistry knowledge is step one. Everything else is text file manipulation and processing. So my goal is to reduce the process to the following:

1. execute "python rscan.py Start0.inp"
2. specify coordinate, number of steps, and step size
3. Sit Back and and enjoy the day.

The 12 steps eliminated are all file processing. In order to eliminate them I am going to automate the process one step at a time. I will also be using this an opportunity to really try Python for the first time. I will start with step four. Step four will teach me Python file manipulation and I/O. These skills will be advantageous to some of the other more complicated steps.

First Some Chemistry

So the other summer I did undergraduate research in computational chemistry. Doing research is part of my undergraduate studies in Chemistry. But because I am also a Computer Science major, I see the potential of computers to automate portions of the research process. Before I can explain how the process can be automated I need to give some chemistry background on relaxed scans. Put simply, relaxed scans calculate the energy level of a molecule as one of its coordinates is adjusted. For instance, consider methane, a carbon surrounded by four hydrogens in a tetrahedral shape. A relaxed scan can calculate the energy of the molecule as a hydrogen atom is removed. The first step is finding lowest energy structure of the molecule. Next, the bond length is slightly increased between carbon and hydrogen and then frozen. With the bond frozen the lowest energy structure is recalculated. The bond is stretched and frozen again. And again the lowest energy structure is calculated. The result is that after every stretch the molecule is allowed to relax to a new geometry. The cycle of stretch and relax is repeated 20 to 50 times. The stretch and relax process is accomplished using software, but each input file is created by copying text from previous output files. This is where I see the potential for some automation.