The awk language is useful for whittling down data from a mega-file to a more manageable file. Basically, instead of columns A-Z with different information, you could select out which letters would be of importance to you. The command can even be used to select values greater than or less than different cutoffs, which can make data analysis faster. The information shown below is from a complete file with 24,525 rows of data.

 CHR             SNP   A1   A2          MAF  NCHROBS
  21      kgp2850918    C    A      0.03691      298
  21      kgp4753447    A    G      0.03716      296
  21      kgp6829524    A    G      0.06419      296
  21     kgp13210339    A    C      0.05667      300
  21     kgp10927414    A    G      0.06419      296
  21     kgp10658468    A    G      0.06667      300
  21      rs10439884    A    G         0.08      300

Using awk can select MAF values from a specific cutoff. In the awk line below, the header is printed through NR == 1 (NR stands for number of records; NF would stand for number of fields and refer to columns), and the remaining data is sorted through column 5 ($5) to select out values below 0.05. Then, a new file is created using > with the new file name specified.

$ awk  'NR == 1; NR > 1 {if ($5<0.05) print}' plink.frq > plinkawk.frq

The new opening lines of this now 350 row file are:

 CHR             SNP   A1   A2          MAF  NCHROBS
  21      kgp2850918    C    A      0.03691      298
  21      kgp4753447    A    G      0.03716      296
  21      kgp5439554    A    G      0.04667      300
  21      kgp9921880    G    A      0.04333      300
  21     kgp13121553    G    A         0.03      300
  21      kgp1799905    A    G      0.04667      300
  21      kgp4273039    A    C       0.0473      296

If, for example, you only wanted to print four specific columns of information to the Terminal (and not a separate file), a command like

$ awk  '{print $1, $2, $3, $9}' mega_data_set.dat

could be used.