Process a New Run From MGI
Chase Mateusiak
08/04/2022
ProcessRun.Rmd
Abstract
After receiving a new run from MGI, align, count and QC brentlabRnaSeqTools package version: 0.0.0.0
Get the metadata from the database
meta = getMetadata(
database_info$kn99$db_host,
database_info$kn99$db_name,
Sys.getenv("db_username"),
Sys.getenv("db_password")
)
Look at it. Make sure it is correct
View(run_df)
Write out
If you have mounted your local to HTCF, you can write directly to HTCF. Otherwise, write to your computer and follow the directions below to move it to HTCF.
sample_sheet = createNovoalignPipelineSamplesheet(run_df, "/scratch/mblab/chasem/rnaseq_pipeline/scratch_sequence")
write_csv(sample_sheet, "/path/to/where/you/write_things/run_<some_identifier>.csv")
Move a file to HTCF
Log into HTCF and make a directory that will store the input/output for this run. For example, if I were processing run_1234
, I would log into HTCF and make a directory like so:
Back on your local computer, send the file from your local to HTCF with scp
# copy the file from your computer to a directory in your personal subdirectory
# of the lab scratch space
$ scp /path/to/where/you/write_things/run_<some_identifier>.csv \
<your_username>@htcf.wustl.edu:/scratch/mblab/<your_username>/rnaseq_pipeline/align_count_results/run_1234
Please note that there is no requirement that the path look like this: <your_username>/rnaseq_pipeline/align_count_results/run_1234
. It is just an example of what it might look like.
On HTCF, start the pipeline
The first time you do this, navigate to your scratch space and do this:
If you have done this before, navigate into your brentlab_rnaseq_nf directory and do this to pull any possible updates:
If you get some sort of error that says something like, “this is not a git directory”, when you know it is, in fact, a git directory, then HTCF deleted some files. In that case, navigate out of brentlab_rnaseq_nf
, delete it (rm -rf brentlab_rnaseq_nf
), and use the git clone
command described above.
Copy the fastq files into scratch
I suggest having a rnaseq_pipeline
directory in your personal scratch space. If you don’t have one, make one, or otherwise navigate to where ever you are keeping rnaseq type data. You can use the script here for the job. Ask if you need help setting this up to use on HTCF. Here is an example, assuming that you have this scriptin your $PWD
Run the pipeline
Navigate into the directory into which you are going to store the input/output of the pipeline, eg:
Make the params file
You will need a file describing the experiment. This should go into the directory where the input/output is stored. It must look like this, and the paths must be correct. Save this as, eg, params_run1234.json. The example below is also shown here
{
"output_dir": ".",
"sample_sheet": "path/to/sample_sheet.csv",
"run_number": "1234",
"KN99_novoalign_index": "/scratch/mblab/chasem/rnaseq_pipeline/genome_files/KN99/KN99_genome_fungidb.nix",
"KN99_fasta": "/scratch/mblab/chasem/rnaseq_pipeline/genome_files/KN99/KN99_genome_fungidb.fasta",
"KN99_stranded_annotation_file": "/scratch/mblab/chasem/rnaseq_pipeline/genome_files/KN99/KN99_stranded_annotations_fungidb_augment.gff",
"KN99_unstranded_annotation_file": "/scratch/mblab/chasem/rnaseq_pipeline/genome_files/KN99/KN99_no_strand_annotations_fungidb_augment.gff",
"htseq_count_feature": "exon"
}
Run nextflow
NOTE: both in the params file, and in the run script below, you must make sure that the paths are correct. They won’t be, unless you change them to make them correct for you.
Next, make a script to run the pipeline. [An example may be found here]((https://github.com/BrentLab/brentlabRnaSeqTools/blob/main/inst/bash/run_novo_nf_pipeline.sh), or you can copy/paste what is below into a file. Remember to update the paths.
#!/bin/bash
#SBATCH --time=15:00:00 # right now, 15 hours. change depending on time expectation to run
#SBATCH --mem-per-cpu=10G
#SBATCH -J your_jobname.out
#SBATCH -o your_jobname.out
ml miniconda
# until HTCF updates and spack is available, this works. When HTCF updates and
# we have spack, ill update this...though at that point, hopefully we are no
# longer using this pipeline
source activate /scratch/mblab/chasem/rnaseq_pipeline/conda_envs/nextflow
mkdir tmp
nextflow run /path/to/brentlab_rnaseq_nf/main.nf \
-params-file /path/to/your_params.json
You can check progress by looking at the squeue and the <your_jobname>.out
. Right now, it is taking a very long time for HTCF to launch nextflow. When HTCF updates to the ‘new’ implementation, it starts much faster.