Get summary metrics of a network's ORA results

get_metrics creates a data.frame that contains specified metrics for all network subsets and their permutations that have a subdirectory in the provided path. It is designed to be run on the output of the webgestalt_network function to prepare summary metrics for plotting with the plot_metrics function. The 'directory' path should contain only directories created by webgestalt_network.

Usage

get_metrics(
  directory,
  organism = "hsapiens",
  database = "geneontology_Biological_Process_noRedundant",
  gene_id = "ensembl_gene_id",
  get_sum = TRUE,
  get_percent = FALSE,
  get_mean = FALSE,
  get_median = FALSE,
  get_annotation_overlap = FALSE,
  get_size = TRUE,
  penalty = 3,
  fdr_threshold = 0.05,
  parallel = FALSE
)

Arguments

directory: a directory containing only the directories of ORA summaries created by webgestalt_network for all networks of interest
organism: a string specifying the organism that the data is from, e.g. "hsapiens" or "scerevisiae". Only required if get_annotation_overlap = TRUE.
database: the gene set database to search for enrichment - see options with WebGestaltR::listGeneSet(). Must be a Gene Ontology "biological process" database if get_annotation_overlap = TRUE.
gene_id: the naming system used for the input genes - see options with WebGestaltR::listIdType() and see webgestalt.org for examples of each type. Only required if get_annotation_overlap = TRUE.
get_sum: boolean whether to get the 'sum' metric, which is the sum of the negative log base 10 of the p-value for the top term of each source node minus 'penalty' times the total number of source nodes.
get_percent: boolean whether to get the 'percent' metric, which is the percent of source nodes with at least one term with a FDR below the 'fdr_threshold'
get_mean: boolean whether to get the 'mean' metric, which is the mean negative log base 10 of the p-value for the top term of each source node regardless of significance
get_median: boolean whether to get the 'median' metric, which is the median negative log base 10 of the p-value for the top term of each source node regardless of significance
get_annotation_overlap: boolean whether to get the 'annotation_overlap' metric, which is the percent of source nodes that are annotated to at least one of the 16 GO terms for which their target genes are most enriched
get_size: boolean whether to get the 'size' metric, which is the number of source nodes in the network subset that have more than one target gene with annotations. This number is used in the calculation of all other metrics.
penalty: the penalty applied to the 'sum' metric for each TF in the network
fdr_threshold: the FDR threshold for a gene set term to be considered significantly over-represented for the purposes of calculating the 'percent' metric
parallel: boolean whether to get the metrics for each network in the directory in parallel - use with caution, as this has not been adequately tested

Value

a list of data.frames, each containing the values of one metric. The columns of a data.frame represent the different subset sizes, and the rows represent the different network permutations. The first row is from the unpermuted networks.