BLAT News
Welcome to the first in an occasional series of newsletters about BLAT, Jim Kent's genome alignment software. In this issue:
- BLAT Upgrade Blasts BLAST
- Pretty Day: psl File Format Utilities
- Parasol: Ease and Comfort under the Sun (or any other cluster)
- Jim Kent: Upcoming Speaking Engagements and Conference Schedule
Note: All software mentioned in this newsletter is copyrighted. BLAT requires a license for commercial users.
If you've been wanting to switch from BLAST to BLAT but dreaded reprogramming your IT structure to deal with BLAT's different output format, here's some great news.
Jim has added options to the -out switch to let you generate output in BLAST, wuBLAST, BLASTz-associated axt, or MULTIz-associated maf formats. The switch works for both BLAT and gfClient. The syntax is as follows:
-out=type Controls output file format. Type is one of:
psl - Default. Tab separated format without actual sequence
pslx - Tab separated format with sequence
axt - blastz-associated axt format
maf - multiz-associated maf format
wublast - similar to wublast format
blast - similar to NCBI blast format
Pretty Day: psl File Format Utilities
If you are using BLAT's native output format, psl, you might have noticed that its output is more machine-readable than human readable. A new utility, pslPretty, lets you convert psl to a format that is easier on the eye.
Here's a sample piece of BLAT output in psl format:
psLayout version 3
match mis- rep. N's Q gap Q gap T gap T gap strand Q Q Q
Q T T T T block blockSizes qStarts tStarts
match match count bases count bases name size
start end name size start end count
----------------------------------------------------------------------------------------------------
-----------------------------------------------------------
420 0 0 0 0 0 3 786 + RNA1 420 0 420
GENOMIC1 7854 3436 4642 4 70,96,222,32, 0,70,166,388, 3436,3673,4264,4610,
Now here's the same piece of output after running through the pslPretty formatter:
>RNA1:0+420 GENOMIC1:3436+4642
agtggacaaccctggccaccccttcatcaagactgtgggcatggtggctggagatgagga
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
agtggacaaccctggccaccccttcatcaagactgtgggcatggtggctggagatgagga
gacctatgag------167------gtatttgctgaactgtttgaccctgtgatccaaga
|||||||||| |||||||||||||||||||||||||||||||||||
gacctatgaggtaggg...tttcaggtatttgctgaactgtttgaccctgtgatccaaga
gcggcataatggatatgaccccagaacaatgaagcacaccactgaccttgatgccagtaa
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
gcggcataatggatatgaccccagaacaatgaagcacaccactgaccttgatgccagtaa
a------495------attcgttctggctactttgatgagaggtatgtattgtcttcaag
| ||||||||||||||||||||||||||||||||||||||||||||
agtgagc...cctcagattcgttctggctactttgatgagaggtatgtattgtcttcaag
agtcagaactggccgaagtatcaggggactcagtctccctccagcctgcactcgggcaga
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
agtcagaactggccgaagtatcaggggactcagtctccctccagcctgcactcgggcaga
gcgaagagaggtagaacgtgttgtggtggatgctctgagtggcctgaagggtgacctggc
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
gcgaagagaggtagaacgtgttgtggtggatgctctgagtggcctgaagggtgacctggc
tggacggtactataggctcagtgagatgacggaggccgaacagcagcagcttattgat--
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
tggacggtactataggctcagtgagatgacggaggccgaacagcagcagcttattgatgt
----124------gaccattttctgtttgataaacctgtgtcccc
||||||||||||||||||||||||||||||||
gagg...cctcaggaccattttctgtttgataaacctgtgtcccc
Ok, pretty may be a relative thing. This output looks best in a constant width font, of course....
Here is the usage information for pslPretty:
pslPretty - Convert PSL to human readable output usage: pslPretty in.psl target.lst query.lst pretty.out options: -axt - save in Scott Schwartz's axt format -dot=N Put out a dot every N records -long - Don't abbreviate long inserts
It's a really good idea if the psl file is sorted by target if it contains multiple targets. Otherwise this will be very very slow. The target and query lists can either be fasta files, nib files, or a list of fasta and/or nib files one per line. Currently this only handles nucleotide based psl files.
In addition to pslPretty, BLAT users now have a concatenation tool called pslCat. Here's the usage information:
pslCat - concatenate psl files usage: pslCat file(s) options: -check parses input. Detects more errors but slower -nohead omit psl header -dir files are directories (concatenate all in dirs) -out=file put output to file rather than stdout -ext=.xxx limit files in directories to those with extension
Both pslPretty and pslCat are considered upgrades to BLAT, and as such they are automatically included in all current licenses!
In addition to pslPretty and pslCat, Jim has written three other utilities for working with psl files, pslSort, pslSortAcc, and pslReps. These copyrighted programs are available free of charge for your convenience. However, they are not included in the BLAT license and are therefore not formally supported. Following is the usage information for these handy utilities:
pslSort - merge and sort psCluster .psl output files
usage:
pslSort dirs[1|2] outFile tempDir inDir(s)
This will sort all of the .psl files in the directories
inDirs in two stages - first into temporary files in
tempDir and second into outFile. The device on tempDir
needs to have enough space (typically 15-20 gigabytes
if processing whole genome)
pslSort g2g[1|2] outFile tempDir inDir(s)
This will sort a genome to genome alignment, reflecting
the alignments across the diagonal. Adding 1 or 2 after
the dirs or g2g will limit the program to only the first
or second pass repectively of the sort.
pslSortAcc - sort pslSort .psl output file by accession.
Make one output .psl file per accession.
usage:
pslSortAcc how outDir tempDir inFile(s)
This will sort the inFiles by accession in two steps.
Intermediate results will be put in tempDir. The final
result (one .psl file per target) will be put in outDir.
Both outDir and tempDir will be created if they do not
already exist. The 'how' parameter should be either
'head' or 'nohead'.
pslReps - analyse repeats and generate genome wide best
alignments from a sorted set of local alignments
usage:
pslReps in.psl out.psl out.psr
where in.psl is an alignment file generated by psLayout
and sorted by pslSort, out.psl is the best alignment output
and out.psr contains repeat info
options:
-ignoreSize Will not weigh in favor of larger alignments so much
-singleHit Takes single best hit, not splitting into parts
-minCover=0.N minimum coverage to output. Default is 0.
-minAli=0.N minimum alignment ratio
default is 0.93
-nearTop=0.N how much can deviate from top and be taken
default is 0.01
-minNearTopSize=N Minimum size of alignment that is near top
for aligmnent to be kept. Default 20.
Parasol: Ease and Comfort under the Sun (or any other cluster)
Parasol is the cluster management system for the University of California Santa Cruz kilocluster, which runs the UCSC Genome Browser and most of the University's bioinformatics jobs, such as the human/mouse alignment project. (Better example, something with BLAT!) Jim Kent wrote Parasol when he couldn't find a commercial program robust enough (and cost-effective enough) to support the needs of the cluster.
Parasol is available free of charge to any user. For information, go to the Parasol documentation web site.
Jim Kent: Upcoming Speaking Engagements and Conference Schedule
If you're hoping to run into Jim to chat about BLAT and other hot topics in bioinformatics, here is his upcoming speaking and conference schedule:
- May 28 - June 2, 2003, 68th Symposium on Quantitative Biology: The Genome of Homo Sapiens, Cold Spring Harbor, NY.
- June 29-July 3, 2003, ISCB Conference on Intelligent Systems for Molecular Biology, Brisbane, Australia. Jim is this year’s Overton Prize winner and will be presenting a keynote speech.