Perform the whole GOeval pipeline on one network

Given one network, evaluate will first make subsets with subset_network, then run Over-Representation Analysis (ORA) with webgestalt_network, followed by get_metrics and plot_metrics to plot selected summary statistics.

Usage

evaluate(
  network,
  reference_set,
  output_directory,
  network_name,
  organism = "hsapiens",
  database = "geneontology_Biological_Process_noRedundant",
  gene_id = "ensembl_gene_id",
  edges = c(512, 1024, 2048, 4096, 8192, 16384, 32768, 65536),
  num_possible_TFs = 0,
  permutations = 3,
  penalty = 3,
  fdr_threshold = 0.05,
  get_sum = TRUE,
  get_percent = FALSE,
  get_mean = FALSE,
  get_median = FALSE,
  get_annotation_overlap = FALSE,
  get_size = TRUE,
  plot = TRUE
)

Arguments

network: path to the file containing the network to create subsets from. The file should be tab-separated with three columns: source node, target node, edge score
reference_set: path to the set of all genes possibly included in the network. Must be a .txt file containing exactly one column of the genes that could possibly appear in the network.
output_directory: path to the directory in which to store the generated network subsets, ORA summaries, and plots
network_name: short name for the network - used in file naming so may not contain spaces
organism: a string specifying the organism that the data is from, e.g. "hsapiens" or "scerevisiae" - see options with WebGestaltR::listOrganism()
database: the gene set database to search for enrichment - see options with WebGestaltR::listGeneSet(). Must be a Gene Ontology "biological process" database if get_annotation_overlap = TRUE.
gene_id: the naming system used for the input genes - see options with WebGestaltR::listIdType() and see webgestalt.org for examples of each type
edges: list of total numbers of edges or average edges per TF to include in each subset
num_possible_TFs: if set to a number > 0, the elements of 'edges' will first be multiplied by this number to get the number of edges for each subset
permutations: the number of randomly permuted networks to create and run ORA on
penalty: the penalty applied to the 'sum' metric for each TF in the network
fdr_threshold: the FDR threshold for a gene set term to be considered significantly over-represented for the purposes of calculating the 'percent' metric
get_sum: boolean whether to get and plot the 'sum' metric, which is the sum of the negative log base 10 of the p-value for the top term of each source node minus 'penalty' times the total number of source nodes.
get_percent: boolean whether to get and plot the 'percent' metric, which is the percent of source nodes with at least one term with a FDR below the 'fdr_threshold'
get_mean: boolean whether to get and plot the 'mean' metric, which is the mean negative log base 10 of the p-value for the top term of each source node regardless of significance
get_median: boolean whether to get and plot the 'median' metric, which is the median negative log base 10 of the p-value for the top term of each source node regardless of significance
get_annotation_overlap: boolean whether to get and plot the 'annotation_overlap' metric, which is the percent of source nodes that are annotated to at least one of the 16 GO terms for which their target genes are most enriched
get_size: boolean whether to get and plot the 'size' metric, which is the number of source nodes in the network subset that have more than one target gene with annotations. This number is used in the calculation of all other metrics.
plot: boolean whether to make plots of the calculated metrics and write them to a pdf

Value

output of get_metrics. Can be used as input to plot_metrics.

Details

The input file should be tab-separated with two or three columns: source node (e.g. transcription factor), target node (e.g. the regulated gene), and, optionally, edge score.