Stephan Bui | Thyroid Cancer Gene Analysis

Thyroid Cancer Gene Analysis

Bioinformatics, Gene Expression Analysis

ROLES

Research

Data Cleaning

Analysis

TIMELINE

November 2024 - December 2024 (16 Hours)

TOOLS

RStudio, GDC Database
(Genomic Data Commons)

My partner and I were challenged to choose a specific mutation within the NRAS gene and analyze how it affects gene expression in cancer. We selected the chr1:g.114713908T>C mutation, located on chromosome 1 at position 114,713,908, where the normal Thymine (T) nucleotide is replaced by a Cytosine (C) nucleotide.

Our study aimed to investigate how this mutation alters gene expression patterns in thyroid cancer, with the goal of uncovering its potential role in tumor progression and disrupted cellular signaling pathways.

To view the full paper, click here

Finding the Mutation

GDC Database

All data was found on the GDC Database from the TCGA-THCA project (The Cancer Genome Atlas Program).

Most of the cases in the data involved the follicular variant of papillary thyroid cancer. This type of cancer forms in the thyroid gland and shows features of both follicular and papillary growth. It’s also the most common variant, making up about 10–15% of all papillary thyroid cancer cases. The data was controlled so cases were open-access and contained either masked somatic mutations or gene expression quantification data.

The data was split into two groups: data with no mutations in the NRAS gene, and data that contained our chosen mutation (chr1:g.114713908T>C) in the NRAS gene. Sample sheets of this subsetted data was then downloaded directly from the GDC Database as CSV files.

Analyzing the Mutation

RStudio

Before analyzing our mutation and its effects, we had to do a little more data wrangling of the dataset, by filtering out low-variance loci and utilizing the DESeq package in R to compare gene expression between our two groups. The total number of genes after filtering included 40,670 genes.

After filtering:
Group 1: No Mutations in the NRAS Gene , 75 samples
Group 2: Chosen Mutation in the NRAS Gene, 20 samples

This figure shows a PCA (Principal Component Analysis) plot comparing samples with and without the NRAS mutation.

Cluster analysis reveals minimal clustering, indicating that the gene expression patterns between our two groups are similar.

This figure shows a volcano plot which shows the magnitude of change (fold change) and statistical significance (p-value) of each gene.

This plot shows that the most differentially expressed genes in the dataset were SYT12, IGHGP, DCSTAMP, IGHG2, and IGHG3.

After analysis, we found that 1567 genes were significantly downregulated in Group 2 relative to Group 1, and 443 genes were upregulated.

Reflection

The main challenge of performing this study was compiling all of my knowledge about RStudio and combining it all into one.

Throughout the process, I found myself constantly troubleshooting, reflecting on my code, and refining each step to make sure everything ran smoothly. It was a test of both my technical skills and my ability to think critically and adapt as I worked through the project.