Wednesday, August 10, 2005

awk stuff

Hey, I learned something new today

Here's a file I have, this is just a small example of it:

*** Week 1 @@@ sfm_146 smm_146 wfm_0 wmm_0
1989,3479 m1 s121 e141
2449,3503 m1 s121 e141
2555,3539 m1 s121 e141
2652,2849 m1 s121 e141
2349,2229 m1 s121 e141
2617,3004 m1 s121 e141
2713,3097 m1 s121 e141
3356,2684 m1 s121 e141
2725,2193 m1 s121 e141
2016,1992 m1 s121 e141
2813,2569 m1 s121 e141
3000,2796 m1 s121 e141
2851,2628 m1 s121 e141
2320,3693 m1 s121 e141
2270,3087 m1 s121 e141
2538,2173 m1 s121 e141
2647,2702 m1 s121 e141
2669,2445 m1 s121 e141
2217,2498 m1 s121 e141
2565,3155 m1 s121 e141
3212,2145 m1 s122 e142
2338,1808 m1 s122 e142
2009,2752 m1 s122 e142
4142,2074 m1 s122 e142
2605,2691 m1 s122 e142
2756,2689 m1 s122 e142
2337,2698 m1 s121 e142
2959,2286 m1 s121 e142
2554,3128 m1 s122 e142
2761,2578 m1 s121 e142
2894,2183 m1 s122 e142
2789,2265 m1 s122 e142
2315,2834 m1 s122 e142
2846,2133 m1 s121 e142
3218,2455 m1 s122 e142
2023,2796 m1 s121 e142
2683,3119 m1 s122 e142
2570,3545 m1 s122 e142

That is the output file from my program regarding moth matings and their locations. The first pair is the (x,y) location of the mating. The second number beginning with 'm' is the mate type, 1 for sterile:sterile, 2 for sterile:wild and 3 for wild:wild matings. The third bit that starts with the 's' is the start time of the mating session. The 'e' is the time they finished mating. My moths are pretty in sync with each other, maybe that's something I should fix later.

What I needed from that file was all the mate start times. I had to find an easy way to parse that data, but how to do it? A friend of mine suggested a quick 'awk' command on it that went something like this:

cat _MatingLocations.txt | awk '{print $3}' > tmp.txt

which just takes the 'third word' from the file and spits it out, in this case I piped it to the tmp.txt file. The tmp.txt file looks like this now:

1
s121
s121
s121
s121
s121
s121
s121
s121
s121
s121
s121
s121
s121
s121
s121
s121
s121
s121
s121
s121
s122
s122
s122
s122
s122
s122
s121
s121
s122
s121

And as you can see, it's got a '1' printed at the very top. You might have noticed that is the third word from the first line '*** Week 1 @@@ sfm_146 smm_146 wfm_0 wmm_0' and is something I might not want. So now I grep'd the file to output only the files with the 's' in it.

grep s tmp.txt > filter.txt

which looks the same as above, just without the 1 at the top. Now I want to get rid of the 's' in front of the values so I can run those values through another program to get the mean value. To remove the 's' I simply grab the substring of it like this

cat filter.txt | awk '{print substr($0,2)}' > inp.dat

which outputs data like this

121
121
121
121
121
...

and can now be read by other programs as numerical values. awk made it a whole lot easier to parse the files than writing a custom script in, say, matlab to do it for me.

And speaking of matlab, I could then open matlab in a terminal 'matlab -nodesktop' then load the data 'load inp.dat' and then calculate the average of the values 'mean(inp)' and then save the data of the average value 'save -ascii avgData.dat'

And to grab a few words from the raw text file I could

cat file.txt | awk '{print $2 " " $3 " " $4}'

-Thanks awk

No comments:

Followers