Aligning and Classifying Sequences

Aligning and Classifying SequencesWorkspace set upFastQ filesTaxonomy FilesDownload mothurMothur workflow overviewMothur workflow detailsBatch scriptsHome

The mothur MiSeq SOP is an excellent resource from the mothur wikipedia site. It's the best place to start if you have MiSeq sequences from the v4 region. The mothur pipeline is suboptimal for use with any other region, as Pat Schloss has made clear many times. However, for a number of reasons, other sequencing products are common and it is possible to adapt mothur to align and identify sequences from those regions. The commands in this tutorial are relatively less annotated than the mothur SOP, so if you are new to using mothur, you should still start there before trying to use this document.

The coordinates in this tutorial are for the V3V4 region, paired end 300bp. Use the following table to look up the correct coordinates for your region. This table is by no means comprehensive, but is a nice shortcut if you happen to be working on a region that I have already worked on!

Region	SILVA 138, for pcr.seqs	screen.seqs	pcr.seqs truncate	screen.seqs truncate
V4 start	13859	1968	—	—
stop	23447	11550	—	—
V3V4 start	11895	2	1967	8
stop	25318	13423	11549	9582
ZYMO V3V4 start	6424	64
stop	22588	62645
V4V5 start	11895
stop	28464
V5V6 start	23440		—	—
stop	34124		—	—

Workspace set up

Work out of a scratch directory, not your home folder. You will create many large intermediate files that will eat away at our shared storage space allocation while using mothur, but if working from global scratch space, they won't count towards the storage limit. It will also be easier to work within mothur if all the files you are working with are in the same location. Don't worry about files getting deleted -- that's why you're making copies of the fastq files, not moving them.

You can use a program like FileZilla or WinFCP to move files around on the MSI servers within a comfortable graphic user interface, but ultimately you must use the unix bash shell to run mothur, which does not have a graphic user interface. Start getting comfortable typing commands directly into the terminal by learning some basic navigation and file management commands. If working on the shell feels really uncomfortable for you, consider doing a short crash course or two. It can't be emphasized enough that you have to get comfortable without the GUI.

First, make a working directory in the global scratch space. Folders in the global scratch space begin wtih /scratch.global/. The leading slash is really important. I always begin the name my working directories with my name since it makes it easier to locate them later. To make your life easier and to type less, store that folder name as a variable.


xxxxxxxxxx
mkdir /scratch.global/yourname_project
export WDIR=/scratch.global/yourname_project

FastQ files

Now, navigate to the directory where your .fastq files are stored. They are probably compressed, and end with the extension .gz. In our lab, they are stored in the following folder, within some more nested subfolders.


xxxxxxxxxx
cd /panfs/roc/groups/2/chun0157/data_release/anotherfolder/anotherfolder

Copy all of the files that end in .gz to your working directory. This assumes you want to align all of your sequences.


xxxxxxxxxx
cp *.gz $WDIR

Now navigate to that folder and verify the files you want are there.


xxxxxxxxxx
cd $WDIR
ls -lt

The options -lt sort your files by time created.

Taxonomy Files

Our lab is currently using version 1.38 of the non redundant SILVA reference database for both alignment and taxonomic classification. You could also use greengenes or the RDP database — all are available on the mothur website. You can download those files directly to your scratch workspace folder on MSI and unpack using these commands at the bash terminal:


xxxxxxxxxx
wget https://mothur.s3.us-east-2.amazonaws.com/wiki/silva.nr_v138.tgz
tar -xvf silva.nr_v138.tgz

If you have access to the chun lab shared folder, you can skip some steps later if you copy over the truncated versions of the alignment file as well. There is a V3V4 file and a V4 file.

Download mothur

Becuase mothur is continually updated to remove bugs, etc., I think it's best to always start with a fresh download in your working directory. If you notice that you are having problems getting a command to work that you know worked before, you can first check for changes to the command on the mothur wiki site. If that doesn't work, you can always download and use an older version of mothur from their website.


xxxxxxxxxx
wget https://github.com/mothur/mothur/releases/download/v1.44.3/Mothur.linux.zip
unzip Mothur.linux.zip

Clean up a bit before continuing. Mothur can be confused if there are other compressed files in the working directory.


rm *.tgz
rm *.zip

Mothur workflow overview

This is less of a pipeline and more of a plumbing network. This workflow has many checkpoints where you may need to return to earlier analysis steps and change options. Because it's more complicated than a pipeline, I've made an overview here to help you navigate.

Mothur workflow details

You can enter these commands interactively or create a batch script to run via SLURM. It's best to do a trial run interactively with only a few samples to work the kinks out before creating a batch script.

To run mothur interactively, type the following into the bash terminal. See the end of this document for help putting together batchs cripts.


mothur/./mothur

The mothur prompt looks a little different than the bash shell, which normally displays a node number and your current directory before the prompt (e.g. galeyxxx@cn105 [/home/chun0157/shared] %). It should now look more like: mothur >

From the mothur prompt, you can enter these commands. Punctuation is incredibly important.


xxxxxxxxxx
set.dir(output=., input=.)

This sets your input and output directory to the current folder. If you didn't take my file organization advice, you'll need to type the paths of where your fastq files are stored and where you would like the mothur output files to be stored.


x
make.file(inputdir=., type=gz, prefix=example)

This command creates a tabular text file named whatever you entered for the argument prefix ending in extension .files. The default type is fastq, but your files are probably compressed, so you need to change it to gz. Examine this document. It should have three columns — one with a unique sample name, one with forward reads, and another with matching reverse reads. If you have followed previous advice on naming your sequencing submissions, mothur will make this for you with no problems. However, sometimes the automatic names mothur generates are incorrect or broken. It's possible that mothur will create non-unique sample names and lump all your samples into one group. Ideally, these names should be the same sample names you've used in your other metadata. Check by typing this command:


system(less example.files)

system is a mothur command that directs any input you make to the bash shell. It's useful for moving and copying files without quitting mothur. less is a bash command that previews the contents of a file and prints them to the screen.

If your file is not set up correctly, fix it either in excel, textedit, notepad, or if the fixes are simple, with something like nano. Read more about how to use nano here. Note that you don't need to install nano, it's a default feature. You will need to quit mothur first to be able to use it.

Your .files look good? Proceed.


x
make.contigs(file=example.files, trimoverlap=T)

make.contigs assembles your forward and reverse reads together. There are many many options for this command — look them up on the mothur wiki. The most important here is trimoverlap. For fully overlapping reads, we want to trim to only the overlapping region. If your sequencing target is longer than the read length you sequenced (e.g. a 400 bp region sequenced with 300bp chemistry), you need to change to trim.overlap=F.


summary.seqs(fasta=current)

This convenient command is entered many times here. It summarizes the length and alignment of your sequences. Here we enter it to verify that make.contigs assembled our reads well. The majority of your sequences should have similar lengths near the length of your targeted region. For fully overlapping V3V4, that's around 295 bp.

We'll now drop sequences that are either too short or too long, had ambiguous base calls, or too many homopolymers. Run summary.seqs to check the results.


xxxxxxxxxx
screen.seqs(fasta=current,group=current,maxambig=0, maxlength=295, minlength=200, maxhomop=8)
summary.seqs(fasta=current)

Now we'll count the number of unique sequences and create a smaller fasta file to reduce processing time.


unique.seqs(fasta=current)
count.seqs(name=current, group=current)
summary.seqs(count=current, fasta=current)

You should now be able to see how many of your sequences were unique.

mothur uses Needleman-Wunsch pairwise alignment to create a large alignment file of all of your sequences. This overall makes downstream processing much faster and easier. However, Needleman-Wunsch is a global alignment algorithm, meaning that it works best when mathching up two DNA sequences that are of the same length. For best results, trim the reference database to the same region that you have amplified.

Reference regions for V3V4 and V4 can be found in the chun lab shared folder on MSI. If you want to create your own, use the table at the top of the page (column SILVA 138, for pcr.seqs) and pick coordinates appropriate for your region. Run this command only if you are creating a reference file.


pcr.seqs(fasta=silva.nr_v138.align, start=11895, end=25318, keepdots=F)

Resuming with whatever reference version you would like to use, now we'll align the sequences.


xxxxxxxxxx
align.seqs(fasta=example.trim.contigs.good.unique.fasta, reference=silva.nr_v138.pcr.v3v4.align)
summary.seqs(fasta=current, count=current)

Remove sequences that aligned badly, outside of where the majority of sequences lined up. These coordinates will differ for other regions.


xxxxxxxxxx
screen.seqs(fasta=current, count=current, summary=current, start=2, end=13423)
summary.seqs(fasta=current, count=current)

This next chunk is optional. It's a good option if you have many samples with high diversity and you've specifically sequenced the V3V4 region. The mothur pipeline struggles with a wide alignment, so removing the 50bp in V3 is one way to reduce the size of your files. This idea relies on the assumption that your alignment is good, and it's still not as high quality of an alignment as you would get if you had just sequenced V4 alone with 250bp chemistry. But it works!


pcr.seqs(fasta=current, start=1967, end=11549, keepdots=F)
summary.seqs(fasta=current, count=current)

Screen the results again to remove badly aligned sequences (there shouldn't really be many).


xxxxxxxxxx
screen.seqs(fasta=current, count=current, summary=current, start=8, end=9582)

Another optional chunk. The MiSeq SOP uses a version of the RDP classifier to assign taxonomy to sequences. The SOP does this near the end, after chimeras have been removed and similar sequences clustered together, which reduces overall processing power needed. However, I prefer to do it earlier for very large datasets to remove Unknown and Eukaryotic sequences. For datasets where I expect the amount of Archaea to be high, I also remove these. Keeping Archaea is easier, since you only have to run mothur once, but because they are so different from Bacteria sequences, they really increase the width of the alignment and can make it prohibitive to cluster OTUs. If you remove them in this step, you will have to repeat the second half of this protocol removing Bacteria instead of Archaea at this step and then combine OTUs at the end. Do this only if you have to.

In any case, the RDP classifier does not rely on having aligned sequences, but because it is based on kmers, should really be drawn from reference FASTA files from the same region of 16S. Sometime I'd like to illustrate this practically.


classify.seqs(fasta=current, count=current, reference=silva.nr_v138.pcr.v4.align, taxonomy=silva.nr_v138.tax, cutoff=80)

Note that if you did not trim to the v4 region, you should use silva.nr_v138.pc.v3v4.align instead. The cutoff value specified here is a bootstrap value (percent) below which taxonomy will not be assigned. It is not a similarity value. 80 is value that the creators of this implementation of the RDP classifier use to validate their method.

Now remove non target sequences.


xxxxxxxxxx
remove.lineage(fasta=current, count=current, taxonomy=current, taxon=Chloroplast-Mitochondria-unknown-Archaea-Eukaryota)
summary.seqs(fasta=current, count=current)

Looking at your results, screen your sequences to only the majority values for start and end. Then run filter.seqs. This will at the end print out the number of columns and width of your alignment.


xxxxxxxxxx
screen.seqs(fasta=current, count=current, start=8, end=9582)
filter.seqs(fasta=current, vertical=T, trump=.)

Is your alignment < 600bp wide? If it's not and you haven't tried either splitting off Archaea or shortening your region to ~250bp, try those before moving on. Other options include analyzing subsets of your data separately, if your hypotheses permit that.

Here we are resuming the MiSeq SOP pretty much as is. I have no real comments about it, but you should read through those on the mothur website if you haven't already. I will note that in my mothur installation, if I am not running mothur from the directory where it is installed,vsearchcan't be located and crashes the program. I always include the path to where it is installed, which is inside of the mothur directory.


unique.seqs(fasta=current, count=current)
summary.seqs(fasta=current, count=current)
pre.cluster(fasta=current, count=current, diffs=1)
summary.seqs(fasta=current, count=current)
chimera.vsearch(fasta=current, count=current, dereplicate=t, vsearch=/scratch.global/yourname_project/mothur/vsearch)
remove.seqs(fasta=current, accnos=current)
summary.seqs(fasta=current, count=current)
classify.seqs(fasta=current, count=current, reference=silva.nr_v138.pcr.v4.align, taxonomy=silva.nr_v138.tax, cutoff=80)
remove.lineage(fasta=current, count=current, taxonomy=current, taxon=Chloroplast-Mitochondria-unknown-Archaea-Eukaryota)
summary.seqs(fasta=current, count=current)
summary.tax(taxonomy=current, count=current)

If your dataset is very large or very diverse, you can run this command to drop singletons. Normally, I do not like to use this command until after I have clustered OTUs. However, you're using it after you have preclustered into sequence variants (ASVs), and the difference may be minimal. If you choose to run this command, running the summary.seqs to set your current fasta to the abundant OTUs is important, as mothur defaults to the rare OTUs. If you're not sure what to type into summary.seqs, look at the output of split.abund.


split.abund(fasta=current, count=current, cutoff=1)
summary.seqs(fasta=example.trim.contigs.good.unique.good.pcr.good.pick.filter.unique.precluster.pick.pick.filter.abund.fasta, count=current)

Now you can cluster your reads into OTUs. It is also perfectly acceptable to not cluster and use ASVs instead.

The cutoff of 0.03 is a similarity value that works for Bacteria and Archaea at the genus level.


dist.seqs(fasta=current, cutoff=0.03)

If your dataset is very large and diverse, it's a good idea to run cluster.split in two steps. This function does two things, which are optimized differently. The first step splits your alignment into many parts and can be run on many processors. However, the second step which clusters your sequences is slowed down by using many processors because of the overhead of sharing information between parallel processes. The second step is most optimized by having a large amount of RAM. In extreme cases, you can run this in two separate batch scripts wtih separate resource requests. cluster.split can also be adjusted to accounted for limited RAM by including the argument large=T, which will slow the process down but save RAM by writing temporary files to disk. Learn more about this in the mothur documentation.


cluster.split(column=current, count=current, cutoff=0.03, cluster=F, processors=24)
cluster.split(file=current, processors=12)

If your dataset is small or limited in diversity, you can get away with using cluster.split in one step.


cluster.split(column=current, count=current, cutoff=0.03, processors=24)

Now you can make your OTU table and other files for downstream analysis in R.


make.shared(list=current, count=current, label=0.03)
classify.otu(list=current, count=current, taxonomy=current, label=0.03)

These commands create files that end in .shared and .cons.taxonomy respectively. You'll need these files and your metadata to complete any of the analysis tutorials on this site.

If you are interested in trying to get more precise taxonomic matches on some of your sequences, using BLASTn or other local alignment algorithms, you will need representative fasta sequences for your otus. You can also use these representative sequecnes to create a phylogenetic tree using a tool like PhyML for example. Get those using this command:


get.oturep(column=current, list=current, count=current)

To retrieve the fasta sequence associated with all of your sequences, run this command.


bin.seqs(list=current, fasta=current)

This can be useful for resolving ambiguity about taxonomic identification for important OTUs.

And that's it! You're done!

Quit!


quit()

Now you have an OTU count table and corresponding taxonomy for all microbes in your dataset.

Batch scripts

Running mothur interactively can be time consuming — it's ultimately more convenient once you're comfortable with mothur to submit a batch script and get this all done while you're sleeping.

For the general rules of submitting a batch script to the SLURM scheduler, please see my tutorial on the subject. For mothur specifically, I would recommend creating a .sh file and a .batch file separately.

One advantage of using a .batch file is that you can supply mothur with unix formatted variable names to make your scripts more flexible. For example, instead of editing your file everytime to change the name of your working directory, you can declare a working directory variable,$WDIR, in your shell script and then use that in your .batch file.

Here's an exmaple .sh file:


x
#!/bin/bash -l
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=24
#SBATCH --cpus-per-task=1
#SBATCH --mem=32gb
#SBATCH -t 24:00:00
#SBATCH --mail-type=ALL
#SBATCH --mail-user=yourx500@umn.edu
#SBATCH -p small
#SBATCH -o %j.out
#SBATCH -e %j.err
TERM=linux
export TERM
export WDIR=/scratch.global/yourname_project
export PREFIX=example
cd $WDIR
mothur/./mothur "mymothurscript.batch"

This simple shell script just supplies mothur with the name of a batch script that has all the mothur commands in it. That batch script might look like this:


x
set.dir(input=$WDIR, output=$WDIR)
make.file(inputdir=$WDIR, type=gz, prefix=$PREFIX)
make.contigs(file=current, trimoverlap=T)
quit()

The batch file simply contains all the commands you would otherwise type into the terminal. You can reference variables you defined in your shell script using the $operator.