Development build for pari-neic/infectious-diseases-toolkit@f446789 (branch: main)
Skip to content Skip to footer

Human biomolecular data: Data analysis

Introduction

Data analysis, as in Human biomolecular data of infectious diseases, involves exploring the data collected to gain an understanding of the messages within a dataset and identifying relationships between variables using mathematical formulas or models. Moreover, it is always crucial to follow the best practices and especially the FAIR principles (Findability, Accessibility, Interoperability, and Reusability) to enable the collection and flow of information in the best way possible.

General considerations

Some considerations for analysing human biomolecular data of infectious diseases are:

  • Select the tools best suited for the analysis of your data.
  • Document the exact steps used for data analysis.
  • Choose between several computing infrastructure types, e.g. cluster, cloud.
  • Take into account the computing resources needed.
  • Which type of data are you using, e.g. DNAseq, ATACseq, CNV.
  • Integration of different types of data (e.g. RNAseq and DNAseq).
  • Ensure following the FAIR principles.
  • Guarantee access to the data and tools for all collaborators, for reproducibility.
    • Providing your code
    • Providing your execution environment
    • Providing your workflows
    • Providing your data analysis execution

When looking for solutions to some of the considerations above, you may have a look at the documentation available on the RDMkit website documentation.

Existing approaches

Below you can find some general existing approaches in order to help with and improve your data analysis pipeline/protocol:

  • Container environments: As an alternative to package management systems you can consider container environments like Docker or Singularity.
  • Web-based platform: Provides a centralised location for software developers to store, manage, collaborate, and share their code. You can use GitHub (widely used), GitLab or Bitbucket.
  • Workflow platforms: Allows the user to manage your data and provide an interface (web, GUI, APIs) to run complex pipelines and review their results. For instance: Galaxy and Arvados (CWL-based, open source).
  • Workflow runners: Allows you to take a workflow written in a proprietary or standardised format (such as the CWL standard) and execute it locally or on a remote computer infrastructure. For instance, toil-cwl-runner, the reference CWL runner (cwltool), Nextflow, Snakemake, Cromwell.
  • Integration Pipelines: These pipelines are used to integrate different types of data, such as genomic, proteomic, and metabolomic data. It involves steps such as data preprocessing, data integration, and functional analysis. Tools like Omicsgenerator can be used for data integration.

Preprocessing

Data preprocessing is the phase in the project where data is converted into a desired format and prepared for analysis. Is a crucial step in data analysis that involves cleaning, transforming, and preparing data for analysis. The goal of preprocessing is to ensure that the data is of high quality and is suitable for the intended analysis. Preprocessing can involve a range of steps, depending on the type of data and the analysis being performed.

Preprocessing is a critical step in data analysis of Human biomolecular data of infectious diseases because it can greatly impact the accuracy and reliability of the analysis results. By ensuring that the data is of high quality and suitable for analysis, preprocessing can help researchers obtain more accurate and meaningful insights from this data.

Considerations

Here are some common considerations involved in data preprocessing:

  • Data cleaning: This step involves identifying and correcting errors or inconsistencies in the data. Examples of data cleaning include removing duplicates, correcting typos or misspellings, and identifying and handling missing data.
  • Data transformation: This step involves transforming the data to make it suitable for analysis. Examples of data transformation include converting data types (e.g., from categorical to numerical), scaling data, and normalising data.
  • Remove low-quality samples: Samples that have low sequencing depth, high number of missing values, or poor alignment quality can be removed from the dataset to ensure that the remaining samples are of high quality.
  • Identify and remove outliers: Outliers are data points that fall outside the expected range of values and can skew the analysis results. Outliers can be identified using statistical methods and removed from the dataset.
  • Check for batch effects: Batch effects are systematic differences in the data that arise from technical or experimental factors. Batch effects can be identified using statistical methods and removed from the dataset.
  • Data normalisation: Normalisation is a common preprocessing step that aims to remove systematic biases in the data. Normalisation methods should be evaluated to ensure that they are effective and do not introduce additional biases.
  • Perform quality control checks at each preprocessing step: Quality control checks should be performed at each preprocessing step to ensure that the data is of high quality and suitable for the intended analysis.

Existing approaches

Preprocessing could be done using the state of art bioinformatics tools and/or programming languages that have different functions and packages to work and process this kind of data. For example, Python, RStudio or using Command-Line, are different approaches to enable the user performing all the necessary and wanted steps to do the desired preprocessing pipeline/protocol.

When looking for quality control protocols, see Human biomolecular data - Quality control page.

Analysis

The analysis of human biomolecular data involves the use of various techniques and approaches to extract meaningful information from biological samples such as DNA, RNA, proteins, and metabolites.

This stage relies on the previous stages (collection, processing) that will lay the foundations for the generation of new knowledge by providing accurate and trustworthy data.

Considerations

  • The location of your data: Proximity to computing resources is crucial due to its impact on data transfer across infrastructures. It is worthwhile to compare the cost of transferring large data volumes versus the transfer of virtual machine images for analysis purposes.
  • Analysis of the data: Prior to analyzing the data, it is necessary to evaluate the computing environment and make a decision among various types of computing infrastructures, such as clusters or clouds. Additionally, selecting the suitable work environment, such as command line or web portal, based on individual requirements and expertise, is crucial.
  • Best tools: You need to select the tools best suited for the analysis of your data.
  • Document the steps: Accurate documentation of the data analysis process is essential, encompassing the precise steps taken, software versions employed, parameters utilized, and the computing environment employed. However, it is important to mention that the “manual” manipulation of the data can potentially complicate this documentation procedure.
  • Collaborative analysis: When engaging in collaborative data analysis, it is crucial to ensure that all collaborators have access to the data and tools required. This can be facilitated by establishing virtual research environments that provide a shared platform for seamless collaboration.

Existing approaches

There are several types of analysis that can be performed on human biomolecular data, depending on the specific research question and type of data being analysed. Here are some common types of analysis:

  • Gene expression analysis: This involves measuring the expression levels of genes in a biological sample and comparing them across different conditions or groups of samples. This can be done using techniques such as microarray analysis or RNA sequencing.
  • Genomic analysis: This involves the interpretation of genetic information encoded in DNA sequences. DNA data analysis can be used for a wide range of applications, such as identifying genetic variants associated with disease, studying the evolution of species, and understanding the molecular mechanisms underlying biological processes.
  • Epigenetic analysis: This involves measuring changes in DNA methylation, histone modifications, or other epigenetic marks in different samples or conditions. This can help to understand how gene expression is regulated and identify potential biomarkers or therapeutic targets.
  • Protein-protein interaction analysis: This involves identifying proteins that interact with each other and exploring the functional consequences of these interactions. This can help to identify new targets for drug development and understand disease mechanisms.
  • Metabolomics analysis: This involves measuring the levels of small molecules (metabolites) in biological samples and comparing them across different conditions or groups of samples. This can help to identify biomarkers of disease or drug response.

Postprocessing

The postprocessing part refers to the steps taken after the initial analysis to refine and interpret the results. Postprocessing steps are important because they can help to identify biological patterns and relationships that were not apparent in the initial analysis, and to ensure that the results are biologically meaningful and reproducible.

Considerations

Some considerations to take into account when performing postprocessing on human biomolecular data include:

  • Interpretation: Once the results have been generated, it is important to interpret them in a biologically meaningful context. This can include identifying enriched pathways or gene sets, performing network analysis, or annotating the results.
  • Visualisation: It is important to visualise the results in a clear and informative way. This can help to identify patterns and relationships in the data, and to communicate the results to others in a clear and accessible way.

Existing approaches

  • Functional Enrichment Analysis: These analyses are used to identify enriched pathways and biological functions associated with differentially expressed biomolecules. It involves steps such as gene ontology analysis, pathway analysis, and network analysis. Tools like GSEA, GO, KEGG, DAVID and Cytoscape can be used for functional enrichment analysis and/or also annotate the results.

  • Visualisation:

All these workflows and tools can be adapted and customised based on the specific type of data being analysed and the research question being addressed.

Related pages

More information

Tools and resources on this page

Skip tool table
Tool or resource Description Related pages Registry
ANNOVAR ANNOVAR is an efficient software tool to utilize update-to-date information to functionally annotate genetic variants detected from diverse genomes. Tool info
Arvados With Arvados, bioinformaticians run and scale compute-intensive workflows, developers create biomedical applications, and IT administrators manage large compute and storage resources.
BioGRID BioGRID is a comprehensive biomedical repository for curated protein, genetic and chemical interactions Tool info Standards/Databases
Bismark Bismark is a program to map bisulfite treated sequencing reads to a genome of interest and perform methylation calls in a single step. Tool info Training
Bitbucket Git based code hosting and collaboration tool, built for teams. Standards/Databases
Bowtie2 Bowtie 2 is an ultrafast and memory-efficient tool for aligning sequencing reads to long reference sequences. Tool info Training
BWA BWA is a software package for mapping DNA sequences against a large reference genome, such as the human genome. Tool info Training
Canu Canu is a fork of the Celera Assembler, designed for high-noise single-molecule sequencing. Tool info
ClustalW ClustalW is a progressive multiple sequence alignment tool to align a set of sequences by repeatedly aligning pairs of sequences and previously generated alignments. Tool info Training
Cromwell Cromwell is a Workflow Management System geared towards scientific workflows.
cwltool Reference implementation to provide comprehensive validation of CWL files as well as provide other tools related to working with CWL.
Cytoscape Cytoscape provides a solid platform for network visualization and analysis Tool info Training
DAVID The Database for Annotation, Visualization and Integrated Discovery (DAVID) provides a comprehensive set of functional annotation tools for investigators to understand the biological meaning behind large lists of genes. Tool info Training
dbNSFP A comprehensive database of transcript-specific functional predictions and annotations for human non-synonymous and splice-site SNVs Tool info
DeepVariant DeepVariant is a deep learning-based variant caller that takes aligned reads (in BAM or CRAM format), produces pileup image tensors from them, classifies each tensor using a convolutional neural network, and finally reports the results in a standard VCF or gVCF file. Tool info
Delly Delly is an integrated structural variant (SV) prediction method that can discover, genotype and visualize deletions, tandem duplications, inversions and translocations at single-nucleotide resolution in short-read and long-read massively parallel sequencing data. Tool info
DESeq2 Differential gene expression analysis based on the negative binomial distribution Tool info Training
Docker Docker is a software for the execution of applications in virtualized environments called containers. It is linked to DockerHub, a library for sharing container images Standards/Databases Standards/Databases Training
Dragen-GATK DRAGEN-GATK Best Practices contains open-source workflows that are compatible between Illumina's platforms and mainstream infrastructure.
EdgeR Empirical Analysis of Digital Gene Expression Data in R Tool info Training
Flye Flye is a de novo assembler for single-molecule sequencing reads, such as those produced by PacBio and Oxford Nanopore Technologies. Tool info Training
FreeBayes FreeBayes is a Bayesian genetic variant detector designed to find small polymorphisms, specifically SNPs, indels, MNPs, and complex events smaller than the length of a short-read sequencing alignment. Tool info Training
Galaxy Open, web-based platform for data intensive biomedical research. Whether on the free public server or your own instance, you can perform, reproduce, and share complete analyses. Provenance Tool info Training
GeneMANIA GeneMANIA helps you predict the function of your favourite genes and gene sets. Tool info Training
ggplot2 ggplot2 is a system for declaratively creating graphics, based on The Grammar of Graphics. Tool info Training
GitHub GitHub is a versioning system, used for sharing code, as well as for sharing of small data. Standards/Databases Standards/Databases Training
GitLab GitLab is an open source end-to-end software development platform with built-in version control, issue tracking, code review, CI/CD, and more. Self-host GitLab on your own servers, in a container, or on a cloud provider. Standards/Databases Training
GO GO is to perform enrichment analysis on gene sets. Tool info Training
GRIDSS GRIDSS is a module software suite containing tools useful for the detection of genomic rearrangements. Tool info
GSEA Gene Set Enrichment Analysis (GSEA) is a computational method that determines whether an a priori defined set of genes shows statistically significant, concordant differences between two biological states Tool info Training
HISAT2 HISAT2 is a fast and sensitive alignment program for mapping next-generation sequencing reads (whole-genome, transcriptome, and exome sequencing data) to a population of human genomes (as well as to a single reference genome). Tool info Training
IGV The Integrative Genomics Viewer (IGV) is a high-performance, easy-to-use, interactive tool for the visual exploration of genomic data. Tool info Training
IntAct IntAct (Molecular Interaction Database) Website Tool info Standards/Databases Training
KEGG A set of annotation maps for Kyoto encyclopedia of genes and genomes (KEGG) Tool info Training
Lumpy A probabilistic framework for structural variant discovery. Tool info
MACS Model-based Analysis of ChIP-Seq (MACS), for identifying transcript factor binding sites. Tool info Training
MAFFT MAFFT is a multiple sequence alignment program Tool info
Manta Manta calls structural variants (SVs) and indels from mapped paired-end sequencing reads. Tool info
matplotlib Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python. Tool info
MetaboAnalyst MetaboAnalyst is a comprehensive platform dedicated for metabolomics data analysis via user-friendly, web-based interface. Tool info Training
MethylKit methylKit is an R package for DNA methylation analysis and annotation from high-throughput bisulfite sequencing. Tool info
methylPipe Base resolution DNA methylation data analysis Tool info
MetSign A computational platform for high-resolution mass spectrometry-based metabolomics
MUSCLE MUSCLE is widely-used software for making multiple alignments of biological sequences. Tool info Training
Mzmine MZmine 3 is an open-source software for mass-spectrometry data processing, with the main focus on LC-MS data. Tool info
Nextflow Nextflow is a framework for data analysis workflow execution Tool info Training
Omicsgenerator Omics Integrator is a package designed to integrate proteomic data, gene expression data and/or epigenetic data using a protein-protein interaction network.
OpenMS OpenMS is an open-source software C++ library for LC-MS data management and analyses. Tool info Training
PhyML PhyML is a software package that uses modern statistical approaches to analyse alignments of nucleotide or amino acid sequences in a phylogenetic framework. Tool info
SICER2 Redesigned and improved ChIP-seq broad peak calling tool SICER
Singularity Singularity is a widely-adopted container runtime that implements a unique security model to mitigate privilege escalation risks and provides a platform to capture a complete application environment into a single file (SIF) Training
Snakemake Snakemake is a framework for data analysis workflow execution Provenance Tool info Training
SnpEff Genetic variant annotation and functional effect prediction toolbox. It annotates and predicts the effects of genetic variants on genes and proteins. Tool info Training
SPAdes SPAdes is an assembly toolkit containing various assembly pipelines. Tool info Training
STAR Spliced Transcripts Alignment to a Reference Tool info Training
toil-cwl-runner The toil-cwl-runner command provides cwl-parsing functionality using cwltool, and leverages the job-scheduling and batch system support of Toil.
UCSC Genome Browser An online tool for analyzing and visualizing genomic data. It allows users to add and share annotations. An automated SARS-CoV-... Tool info Standards/Databases
VarScan Variant calling and somatic mutation/CNV detection for next-generation sequencing data Tool info
VEP VEP (Variant Effect Predictor) predicts the functional effects of genomic variants. Tool info Training
wtdbg2 Wtdbg2 is a de novo sequence assembler for long noisy reads produced by PacBio or Oxford Nanopore Technologies (ONT). Tool info
XCMS Metabolomic and lipidomic platform Tool info Training
Contributors