Data Organization Throughout Your Project

Metadata

Before you do anything, before even submitting your samples to be sequenced, your metadata should be well organized. Even if the only metadata that you have is an extraction date and a plate number, that should all be organized in a spreadsheet.

Each of your samples should have a unique identifier — not necessarily a meaningful identifier. That identifier can be used in your sequence submission. Avoid using overly long names. Metadata like the sampling site, replicate number, etc. are better to include in the metadata, not the file name.

Think of the identifier like a key. You will use it to connect your metadata to the OTU/ASV tables you make using your sequences, and possibly to other environmental data. One good option is a two-three letter abbreviation (I often use my initials) and the row number in your spreadsheet. Avoid using punctuation (_, ., ;) in your sample identifier. Likewise, do not include spaces or special characters in either your sample names or your column names. It's ok to use punctuation like . or_, but not ()[]{},/\!@#$%^&*-+.

Example:

SID	Site	Date	SampleType
MG001	SKch1	09/5/19	Root
MG002	SKch1	06/01/20	Root
MG003	BRLch2	09/5/19	Root
MG004	BRLch2	06/01/20	Root

important: To make downstream processing easier in mothur and R, make the first part of your sample name the identifier, e.g. MLG124_skch1_9_20 not skch1_MLG124_9_20. Using underscores to separate pieces of your file name is fine, but make sure that you are consistent if you do this! Always name your variables systematically. If you keep that in mind as you are constructing your data, repeating analysis on subsets of your data and fixing problems will be much easier.

Sequence Submission

Again, always begin your sample name with the sample identifier.

MSI shared space

You will likely conduct all of your alignment/identification related analysis and maybe some of the downstream analysis using the Minnesota Supercomputing Institute's resources. Once you've been granted access to use our lab's portion of resources, you'll need to be mindful of the shared space limitations.

Your space

When you log into the mesabi or mangi computing cluster, you arrive in your home folder. This area does not have its own designated space limitations, but counts towards the total space limit that our lab has. If you ever want to check your personal contribution to our lab's space use, check this page. You may need to log in first. Currently, we have about 213 GB of storage space for the whole lab. For that reason, you should not think of your home directory as personal storage space, but as a loading zone for commonly used files. Your primary storage space should still be your personal computer.

Moving files to and from your personal computer

You can use a program like FileZilla or WinFCP to move files to and from MSI storage. See here for a tutorial on how to do so.

I find it more convenient and reliable to use sftp to move files to and from my computer. It's a good option if your computer is a Mac or running a Linux distribution. To use it, simply type sftp yourx500@mesabi.msi.umn.edu into a terminal.

Once logged in, you arrive in your home folder and can use basic navigation commands (cd, ls, etc.) as in a regular bash shell, in addition to two important commands. To download files to your computer, use the get command followed by the filename you wish to download. To upload files from your computer, use the put command followed by a filename. Note that you must be in the local directory on your laptop you want to download to or upload from before typing sftp.

If you're running Windows, you'll need to install PuTTY to use this strategy. You have to download it anyway to connect to MSI resources anyway! PuTTY will install a program called PSFTP that works in a similar way to sftp. You'll need to add PSFTP to your path variable in order to use it, but afterwords logging works the same. From the windows command prompt, type psftp yourx500@mesabi.msi.umn.edu to log in and then use the same put and get commands to move files to and fro. For more details on how to use PSFTP, see the documentation.

Work space

mothur creates some enormous intermediate files that make working out of the shared lab space a bad idea. Fortunately, MSI has unlimited global scratch space that gets deleted every 30 days or so. To make a folder in the global scratchspace, make a directory that begins with /scratch.global/. By default, files you place here are visible to others in the work group, but not to the public.

Example:


mkdir /scratch.global/myname_example

It will be easiest to make separate scratch folders for each of your projects. Avoid nesting folders if possible.

Cold Storage

For archival storage, or if you are walking away from your project for more than 30 days at a time, use the Second Tier Storage available through MSI. Like the global scratch space, this is a nearly unlimited resource. This is the best place to store your sequences after your project has ended, and to store intermediate mothur files if those are something you want to keep.

Accessing second tier storage is a little more inconvenient than keeping things in the home directory, but it's the polite thing to do.

Use s3cmd to access cold storage. Your directories are stored in 'buckets', which you have to make, like this;


xxxxxxxxxx
s3cmd mb s3://nameofabucket

To examine files in your storage bucket:


s3cmd ls s3://nameofabucket

To put files in your storage bucket:


s3cmd put thenameofafile.txt s3://nameofabucket

To retrieve files from a bucket


s3cmd get s3://nameofabucket/thenameofafile.txt

If you simply want to archive the current contents of a folder, use sync:


s3cmd sync /scratch.global/myname_example s3://nameofabucket

For more on how to use second tier storage, see here.

You may recieve many error messages when copying a large amount of files to second tier storage. This has to do with your connection, and it's not really something you should worry about.

Analysis in R

RNotebooks

I prefer working out of RNotebooks to making simple RScripts. The major difference is that the results of your scripts are archived along with your code. It is also easier to make text notes immediately below your output, which makes writing easier in turn. All of the code produced on this website (with a few exceptions) was made with RNotebooks.

For general help getting set up with notebooks, see this tutorial.

mothur file structure

At the end of the mothur workflow, you will arrive at three files. A .shared file, a .cons.taxonomy file, and a .fasta file if you chose to get representative sequences. The .shared and .cons.taxonomy files are just large spreadsheets, hereafter called dataframes. You can manipulate these directly in any program that you like, but because they are enormous files, Microsoft Excel will probably perform poorly. Most analysis that you will do is about connecting the .shared file, which is an OTU table, to the .cons.taxonomy file, which is a taxonomy table, to your metadata file., which has sample information. The reason these are broken up into three files is because of the enormous size, and because data transformations, like transforming count data to some other measure of abundance, are easiest if the tables are not combined.

phyloseq

phyloseq is an extremely convenient package designed to make combining your .shared, .cons.taxonomy, and metadata file easier. It can also combine information from a phylogenetic tree, if you have one.