Process a New Run From MGI

Abstract

After receiving a new run from MGI, align, count and QC brentlabRnaSeqTools package version: 0.0.0.0

library(brentlabRnaSeqTools)
library(tidyverse)

Get the metadata from the database

meta = getMetadata(
  database_info$kn99$db_host,
  database_info$kn99$db_name,
  Sys.getenv("db_username"),
  Sys.getenv("db_password")
)

Filter out the reads of interest

run_df = meta %>%
  filter(runNumber == 5500)

Look at it. Make sure it is correct

View(run_df)

Write out

If you have mounted your local to HTCF, you can write directly to HTCF. Otherwise, write to your computer and follow the directions below to move it to HTCF.

sample_sheet = createNovoalignPipelineSamplesheet(run_df, "/scratch/mblab/chasem/rnaseq_pipeline/scratch_sequence")

write_csv(sample_sheet, "/path/to/where/you/write_things/run_<some_identifier>.csv")

Move a file to HTCF

Log into HTCF and make a directory that will store the input/output for this run. For example, if I were processing run_1234, I would log into HTCF and make a directory like so:

$ mkdir /scratch/mblab/chasem/rnaseq_pipeline/align_count_results/run_1234

Back on your local computer, send the file from your local to HTCF with scp

# copy the file from your computer to a directory in your personal subdirectory 
# of the lab scratch space
$ scp /path/to/where/you/write_things/run_<some_identifier>.csv \ 
      <your_username>@htcf.wustl.edu:/scratch/mblab/<your_username>/rnaseq_pipeline/align_count_results/run_1234

Please note that there is no requirement that the path look like this: <your_username>/rnaseq_pipeline/align_count_results/run_1234. It is just an example of what it might look like.

On HTCF, start the pipeline

The first time you do this, navigate to your scratch space and do this:

$ git clone https://github.com/cmatKhan/brentlab_rnaseq_nf.git

If you have done this before, navigate into your brentlab_rnaseq_nf directory and do this to pull any possible updates:

$ git pull https://github.com/cmatKhan/brentlab_rnaseq_nf.git

If you get some sort of error that says something like, “this is not a git directory”, when you know it is, in fact, a git directory, then HTCF deleted some files. In that case, navigate out of brentlab_rnaseq_nf, delete it (rm -rf brentlab_rnaseq_nf), and use the git clone command described above.

Copy the fastq files into scratch

I suggest having a rnaseq_pipeline directory in your personal scratch space. If you don’t have one, make one, or otherwise navigate to where ever you are keeping rnaseq type data. You can use the script here for the job. Ask if you need help setting this up to use on HTCF. Here is an example, assuming that you have this scriptin your $PWD

$ ./fastqFilesToScratchFromSamplesheet.sh path/to/sample_sheet.csv /lts/mblab/sequence_data/rnaseq_data/lts_sequence

Run the pipeline

Navigate into the directory into which you are going to store the input/output of the pipeline, eg:

$ cd rnaseq_pipeline/align_count_results/run_1234

Make the params file

You will need a file describing the experiment. This should go into the directory where the input/output is stored. It must look like this, and the paths must be correct. Save this as, eg, params_run1234.json. The example below is also shown here

{
"output_dir": ".",
"sample_sheet": "path/to/sample_sheet.csv",
"run_number": "1234",
"KN99_novoalign_index": "/scratch/mblab/chasem/rnaseq_pipeline/genome_files/KN99/KN99_genome_fungidb.nix",
"KN99_fasta": "/scratch/mblab/chasem/rnaseq_pipeline/genome_files/KN99/KN99_genome_fungidb.fasta",
"KN99_stranded_annotation_file": "/scratch/mblab/chasem/rnaseq_pipeline/genome_files/KN99/KN99_stranded_annotations_fungidb_augment.gff",
"KN99_unstranded_annotation_file": "/scratch/mblab/chasem/rnaseq_pipeline/genome_files/KN99/KN99_no_strand_annotations_fungidb_augment.gff",
"htseq_count_feature": "exon"
}

Run nextflow

NOTE: both in the params file, and in the run script below, you must make sure that the paths are correct. They won’t be, unless you change them to make them correct for you.

Next, make a script to run the pipeline. [An example may be found here]((https://github.com/BrentLab/brentlabRnaSeqTools/blob/main/inst/bash/run_novo_nf_pipeline.sh), or you can copy/paste what is below into a file. Remember to update the paths.

#!/bin/bash

#SBATCH --time=15:00:00  # right now, 15 hours. change depending on time expectation to run
#SBATCH --mem-per-cpu=10G
#SBATCH -J your_jobname.out
#SBATCH -o your_jobname.out

ml miniconda

# until HTCF updates and spack is available, this works. When HTCF updates and 
# we have spack, ill update this...though at that point, hopefully we are no 
# longer using this pipeline
source activate /scratch/mblab/chasem/rnaseq_pipeline/conda_envs/nextflow

mkdir tmp

nextflow run /path/to/brentlab_rnaseq_nf/main.nf \ 
             -params-file /path/to/your_params.json

You can check progress by looking at the squeue and the <your_jobname>.out. Right now, it is taking a very long time for HTCF to launch nextflow. When HTCF updates to the ‘new’ implementation, it starts much faster.

Chase Mateusiak

08/04/2022