Integration of proteomics with genomics and transcriptomics increases the diagnosis rate of Mendelian disorders

Integration of proteomics with genomics and transcriptomics increases the diagnosis rate of Mendelian disorders

This project contains different scripts to automatize and visualize analysis performed for the "Integration of proteomics with genomics and transcriptomics increases the diagnosis rate of Mendelian disorders" paper.

Webserver, produced as one of the outputs of the pipeline.

Project structure

This project is setup as a wBuild workflow. This is an automatic build tool for R reports based on snakemake.

The wbuild.yaml is the main configuration file to setup up the workflow
The Scripts folder contains scripts which will be rendered as HTML reports
The src folder contains additional helper functions and scripts
The Output folder will contain all files produced in the analysis pipeline
- Output/html contains the final HTML report

Data and prerequisites

This project depends on the packages wBuild and PROTRIDER

The pipeline starts with the series of files available via Zenodo: DOI:10.5281/zenodo.4501904

raw_data
proteomics_annotation.tsv - sample annotation
proteomics_not_normalized.tsv - Proteomics intensity matrix
raw_counts.tsv - RNA-seq count matrix
Patient_HPO_phenotypes.tsv - Phenotype data recorded using HPO terms for diagnosed cases.
enrichment_proportions_variants.tsv - Results of rare variant enrichment/proportion analysis calculated on the full dataset.
patient_variant_hpo_data.tsv - Gene annotation for all individuals. Since the genetic data are not publicly shareable, we provide only gene-level information for outlier genes only.
datasets
disease_genes.tsv - List of Mendelian disease genes aggregated from several studies.
HGNC_mito_groups.tsv - Subset of HGNC gene groups related to mitochondria.
Downloaded automatically:
gencode.v29lift37.annotation.gtf.gz - Gene-level model based on the GENCODE 29 transcript model
Table_S1_gene_info_at_protein_level.xlsx - Supplementary Tble1 from GTEx proteomics study Jiang et al, 2020, Cell Data is available at the GTEx page
allComplexes.txt - CORUM protein complexes, available at CORUM web page

The proteomic raw data and MaxQuant search files have been deposited to the ProteomeXchange Consortium via the PRIDE partner repository and can be accessed using the dataset identifier PXD022803

Repository setup

First download the repo and its dependencies:

# analysis code
git clone https://github.com/gagneurlab/omicsDiagnostics
cd omicsDiagnostics

and install wbuild using pip by running.

pip install wBuild
wBuild init

Since wBuild init will reset the current Snakefile, readme.md, and wbuild.yaml we have to revert them again with git.

git checkout Snakefile
git checkout wbuild.yaml
git checkout readme.md

Install dependencies

R packages
Make sure that data.table is installed or install with install.packages("data.table")
Rscript src/installRPackages.R src/requirementsR.txt
Create Conda environment
conda env create --name omicsDiagnostics --file=environment.yml

Run the full pipeline

To run the full pipeline, execute the following commands with 10 cores in parallel:

conda activate omicsDiagnostics
snakemake graph
snakemake publish -j 10