Metabolic pathways: microarray analysis tools
Just brief notes and many links for my (and perhaps your) memory and benefit. Based largely on this Plos Computational Biology review by Khatri et al., 2012.
Pathway-centric approaches are intended to reduce the complexity of the transcriptome profiling data and are well accepted because a significant modulation of a pathway (rather than single genes) is deemed to increase the explanatory power of the observations.
1st generation ORA: Over-Representation Analysis statistically evaluate the fraction of genes in a particular pathway found among the list of genes with significantly different fold changes. This is the most common analysis. Limitations: once the scientist select the list of genes, ORA treats each gene equally, irrespectful of fold changes. Thus, you miss marginally less significant genes causing some information loss that could be important inside a pathway. In addition, ORA approaches assume that for each gene, the FC is independent on other FC genes, which is not the case, for instance a decrease in one transcription factor can cause a decrease of its target genes.
2nd generation FCS: Functional Class Scoring. In response to ORA limitations, FCS methods assess ALL gene expression measurements, so they do not require an arbitrary threshold. Good to see effects on genes changing less than 2-fold. The most popular FCS approach is the Gene Set Enrichment Analysis (GSEA). GSEA ranks all the gene expression changes and map their significant over-representation inside a pathway (called gene set). One possible limitation is that the fold changes are used only to make the rank. The rank than is used for statistical analysis. Other people say this is good because gives robustness to outliers.
3rd generation PT: Pathway Topology-based approaches. Both ORA and FC methods consider a pathway just like a gene set: nothing more than a list of genes. However, inside a pathway, the genes have diffeent relationships. The pathway topology describes the shape of these logical relationships and defines the features of the corresponding network between the genes in the pathway. Instead of considering the most different genes (ORA) or the whole rank of gene expression (FC), the third generation algorithms use topological informations to compute the statistics, similarly to what the Google PageRank makes with internet pages. The major limitation is the fact that most topological relationships are poorly curated in biological knowledge-bases, and some relationships are reportedly tissue-specific, so few PT-based tools are available as complete stand-alone solutions.
Note: don't be fooled by the title, other than microarrays, these approaches are valid for other mRNA profiling technologies including nanostring and RNA-seq. In addition, the same pathway-centric statistics can be applied to 'omics' profiling experiments other than RNA.