How to do NGS Data analysis on Windows Subsystem for Linux (WSL)?

NGS data analysis

Next Generation Sequencing

Next-Generation Sequencing (NGS) is a revolutionary technology that allows scientists to sequence DNA and RNA at an unprecedented speed and scale. Unlike traditional Sanger sequencing, which determines the sequence of one DNA fragment at a time, NGS can sequence millions of fragments simultaneously. Bioinformatic tools are used to perform NGS data analysis.

How NGS works:

  1. Library preparation: DNA or RNA is broken into smaller fragments, and adapters are added to both ends of each fragment.
  2. Sequencing: Millions of fragments are sequenced in parallel, generating short DNA sequences called reads.
  3. NGS Data analysis: The generated data is processed and analyzed to reconstruct the original DNA or RNA sequence.

NGS Data: FASTQ Files

The primary output of NGS sequencing is a collection of FASTQ files. Each FASTQ file contains a set of sequences and their corresponding quality scores.

  • Sequence: A string of DNA or RNA bases (A, T, C, G, or U).
  • Quality score: A numerical representation of the base call accuracy.

FASTQ files are typically compressed using gzip to save storage space.

Example of a FASTQ entry:

@SEQ_ID
GATTTGGGGTTCAAAGC
+
!''*(((()&&&%%%**++))%%%))))))
  1. The first line (@SEQ_ID) is the sequence identifier.
  2. The second line is the DNA sequence.
  3. The third line is an optional description (usually a +).
  4. The fourth line contains the quality scores for each base in the sequence.

These FASTQ files serve as the raw data for subsequent NGS data analysis, such as quality control, read mapping and variant calling.

What is WSL?

Windows Subsystem for Linux (WSL) is a compatibility layer that allows you to run Linux environments directly on Windows without the need for a virtual machine. This means you can enjoy the best of both worlds: the familiar Windows interface and the powerful command-line tools and utilities of Linux to perform NGS data analysis.

Why Use WSL?

  • Develop cross-platform applications: Seamlessly create and test applications for Windows and Linux environments.  
  • Access a vast ecosystem of Linux tools: Utilize powerful command-line tools like Bash, Git, and Python without leaving Windows.  
  • Improve developer productivity: Streamline workflows and increase efficiency with a familiar Linux environment.
  • Run Linux-based servers and services: Host web servers, databases, and other applications directly on your Windows machine.

System Requirements

Before you start, ensure your Windows system meets the following requirements:

  • Windows 10 (version 1903 or later) or Windows 11
  • Processor with virtualization capabilities enabled in BIOS

Activating WSL

To activate WSL, you’ll need to enable the “Virtual Machine Platform” optional feature:

  1. Open Control Panel: Press Win + R, type control, and press Enter.
  2. Navigate to Programs: Click on “Programs and Features” and then “Turn Windows features on or off.”  
  3. Enable Virtual Machine Platform: Check the box next to “Virtual Machine Platform” and click “OK.”
  4. Enable Windows Subsystem for Linux: Check the box next to “Windows Subsystem for Linux” and click “OK.”
  5. Restart your computer: This step is crucial for the changes to take effect.

Installing a Linux Distribution

Once WSL is activated, you can install your preferred Linux distribution from the Microsoft Store:

  1. Open Microsoft Store: Search for your desired Linux distribution (e.g., Ubuntu, Debian, Kali Linux).
  2. Install the distribution: Click the “Get” button to install the distribution.  
  3. Create a username and password: Follow the on-screen instructions to create a user account for your Linux distribution.

Using WSL

After installation, you can launch your Linux distribution from the Start menu or by typing wsl in the command prompt. You’ll be presented with a Linux terminal where you can run commands, install software, and use Linux tools.  

Basic WSL Commands:

  • wsl -l: Lists installed Linux distributions.
  • wsl -d <distribution>: Sets a default distribution.
  • wsl --unregister <distribution>: Uninstalls a distribution.

Additional Tips

  • Update Linux distribution: Keep your distribution up-to-date with the latest packages and security patches.
  • Explore WSL features: Discover advanced features like file sharing, GUI applications, and performance optimizations.
  • Leverage WSL for development: Use WSL to build web applications, machine learning projects, and more.  

By following these steps, you can harness the power of Linux within your Windows environment.

Read more: How to Perform Genomic Data Analysis?

Setting Up an NGS Data Analysis Environment on WSL Using Miniconda

This guide will walk you through setting up a dedicated environment for NGS data analysis on Windows Subsystem for Linux (WSL) using Miniconda. We’ll create a clean environment and install tools like FastQC, BWA, Samtools, Picard, and Bcftools to perform ngs data analysis.

1. Install Miniconda

  • Download the Miniconda installer: Visit https://docs.conda.io/en/latest/miniconda.html to download the appropriate Miniconda installer for Linux.
  • Run the installer: Open a terminal and navigate to the download directory. Run the installer with the following command, replacing path/to/miniconda3-latest-Linux-x86_64.sh with the actual path to your downloaded file
  • Follow the on-screen instructions. It’s recommended to add Miniconda to your PATH.
  • Verify installation: Open a new terminal and type conda --version to check if Miniconda is installed correctly.

2. Create a New Environment

  • Create the environment: Use the following command to create a new environment named ngs_env `conda create -n ngs_env python=3.8` Replace python=3.8 with your desired Python version if needed.
  • Activate the environment: Activate the environment using `conda activate ngs_env`

3. Install Required Packages

  • Install Bioconda: Add the Bioconda channel to access bioinformatics packages `conda config --add channels bioconda` `conda config --add channels conda-forge`
  • Install tools: Install the required tools using Conda `conda install -c bioconda fastqc bwa samtools picard bcftools`

5. Verify Installation

  • Check tool versions: Verify that the tools are installed correctly by checking their versions:

Additional Considerations

  • Java: Some tools like Picard might require Java. Ensure Java is installed and set in your PATH.
  • Reference Genomes: Download reference genomes (e.g., human genome) for BWA indexing and alignment.
  • Environment Management: Use conda list to see installed packages, conda remove to uninstall packages, and conda deactivate to deactivate the environment.

You now have a Miniconda environment with essential NGS tools installed on your WSL system. This setup provides a flexible and isolated environment for your NGS data analysis projects.

Note: The specific installation steps and package versions might change over time. Refer to the official documentation of Miniconda and the tools for the latest instructions.

1. Quality Control with FastQC

The first step in NGS data analysis is Quality Control. FastQC is used to assess the quality of your raw sequencing data.

Command:

This command generates HTML reports for each input FASTQ file, providing information about read length, base quality, adapter contamination, and other quality metrics.

2. Read Mapping with BWA

BWA (Burrows-Wheeler Aligner) is a popular choice for short-read mapping in NGS data analysis due to its speed, accuracy, and versatility. It employs the Burrows-Wheeler Transform (BWT) algorithm to efficiently index and search the reference genome.

Key reasons for using BWA:

  • Speed: BWA is known for its fast mapping speed, crucial for handling large NGS datasets.
  • Accuracy: It provides high mapping accuracy, essential for downstream analyses like variant calling.
  • Versatility: BWA can handle different read lengths and sequencing error rates. It offers multiple algorithms (BWA-MEM, BWA-backtrack) to suit different data types and analysis goals.
  • Efficiency: The BWT-based indexing allows for rapid search and alignment of short reads.
  • Widely used: BWA is a well-established tool with extensive documentation and community support.

2.1. Index the reference genome

Before mapping, during NGS data analysis we use a command to create an index of the specified reference genome in FASTA format.

This index is essential for the efficient alignment of sequencing reads using BWA. It involves creating several auxiliary files with specific extensions (.amb, .ann, .bwt, .pac, .sa) that contain compressed representations of the genome sequence and other information necessary for rapid search and alignment during NGS data analysis.

Command

2.2. Map reads to the reference genome

This step aligns/maps the reads to the reference genome. This is one of the crucial steps in NGS data analysis.

BWA’s algorithms:
  • BWA-MEM: Generally preferred for Illumina reads longer than 70bp due to its higher accuracy and speed in this range.
  • BWA-backtrack: Suitable for shorter reads (30-70bp) but might be less accurate than BWA-MEM for longer reads.

Command

3. Convert SAM to BAM

This step converts the text-based alignment Map file (SAM) to it compressed binary version (BAM) for efficient downstream analysis.

Command

4. Sort BAM file

This step sorts the BAM file based on the coordinates.

Command

4.1. Index the sorted BAM file

Command

5. Quality Control Post-Mapping with Samtools

Samtools offers various tools to manipulate and analyze BAM files.

5.1. Flagstat: Provides summary statistics about aligned reads.

Command

5.2. Index statistics: Provides information about mapped reads per chromosome.

Command

6. Mark Duplicates with Picard

Picard tool can be used for marking PCR and Optical Duplicates in the BAM file

Command

7. Variant Calling with Bcftools

Bcftools can be used for variant calling.

Call variants: This command performs base pileup and call variants. Adjust the -d parameter to control the maximum depth.

Command

8. Variant Annotation

Variant annotation is the process of adding biological information to genetic variants identified in a VCF (Variant Call Format) file. This information can include gene names, protein changes, functional impacts, population frequencies, and clinical significance. By annotating variants, researchers can gain insights into the potential consequences of these genetic variations and their relevance to diseases or phenotypes.

Tools for Variant Annotation

Several tools are available for variant annotation, each with its own strengths and focus:

Comprehensive Annotation Suites

  • ANNOVAR: Offers a wide range of annotation options, including gene-based, region-based, and functional annotations. It integrates with various databases.
  • SnpEff: Predicts the effects of variants on protein sequences, including amino acid changes and potential functional impacts.
  • VEP (Variant Effect Predictor): Provides comprehensive annotation, including gene and transcript information, protein changes, regulatory impacts, and population frequencies.

Other Tools

  • GATK Funcotator: Part of the GATK toolkit, it can annotate variants with various types of information.

Key Annotation Information

  • Gene information: The gene affected by the variant.
  • Transcript information: The transcript affected by the variant.
  • Protein changes: The amino acid changes caused by the variant.
  • Functional impact: The potential impact of the variant on protein function (e.g., missense, nonsense, synonymous).
  • Population frequency: The frequency of the variant in different populations.
  • Clinical significance: Information about the variant’s association with diseases or phenotypes.

This basic workflow provides a foundation for NGS data analysis. It’s essential to adapt the pipeline based on specific research questions, data types, and computational resources. Always refer to the documentation of the tools for detailed usage and options.

Note: This is a simplified overview, and real-world NGS analysis often involves more complex steps and quality control measures.

Don’t miss out on Science!

We don’t spam! Read our privacy policy for more info.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top