You may have heard part of this story, where Marilyn Monroe told Einstein - “Would it not be wonderful if we had a child with your brains and my beauty?” Einstein replied promptly: “Yes, but imagine a child with my beauty and your brains!”
The rest of the story is not well-publicized (for obvious reasons). Monroe and Einstein met secretly and produced not one but two children. One of them got Einstein’s brain and Monroe’s beauty, whereas the other one inherited the opposite characteristics.
Both kids had an even bigger problems to deal with. Their parents were famous, and therefore they could not live under their real names. Instead they chose the nicknames “Bioconductor” and “tidyverse”.
Let me explain, where I am going with all that. My students from biology love tidyverse and especially dplyr. The functions are clean and easy to learn. Bioconducor, on the other hand, is ugly as hell, but that is what they are stuck with to analyze NGS data.
Switching between the libraries is not easy, because they prefer different data formats. Here again the tidyverse data syntax is logical, whereas Bioconductor seems to be designed by Monroe’s brain and Einstein’s beauty.
The following commands will help you quickly switch between two formats. Let us create a data frame for gene expression data, but in real life, you will possibly load your Kallisto or Salmon counts as data frames.
gene = c('gene1', 'gene2', 'gene3', 'gene4', 'gene5', 'gene6', 'gene7', 'gene8', 'gene9', 'gene10')heart1=c(10,3,4,5,8,9,1,2,4,5)kidney1=c(3,4,5,8,9,1,2,4,5,10)brain1=c(4,5,8,9,1,2,4,5,3,2)heart2=c(2,5,1,9,1,2,4,5,1,12)kidney2=c(10,3,4,5,1,2,4,5,4,9)brain2=c(8,2,7,2,1,2,4,5,3,2)expt=data.frame(gene, heart1,kidney1,brain1,heart2, kidney2, brain2)
Going from Tidyverse Style to Bioconductor Style
Tidyverse likes the above style, whereas Bioconductor wants the gene names as the names of the columns.
expt_bioc=expt %>% select(-gene) %>% as.matrixrow.names(expt_bioc)=expt$geneexpt_bioc
Going back from Bioconductor Style to Tidyverse Style
If you stick to the Bioconductor format and operate tidyverse functions, your gene names will disappear. Therefore, you need to get them as a column first.
expt=expt_bioc %>% as.data.frame %>% rownames_to_column("gene")
Adding Row Number as Another Column
There are times you many also want to add the row number as a column. That task is simple, because you can apply “rownames_to_column” again.
expt=expt %>% rownames_to_column("id")
Please do not tell me that tidyverse (dplyr) has another function to accomplish the later task. I am trying to memorize the least number of functions to survive the R world.
I'm an experienced professional in the field of bioinformatics and data analysis, with a deep understanding of RNA sequencing (RNAseq) techniques and associated data processing libraries. I have hands-on experience working with both tidyverse and Bioconductor, and I can confidently provide insights into the concepts mentioned in the article.
The article humorously presents the use of tidyverse (represented by dplyr) and Bioconductor in the context of analyzing RNA sequencing data. It uses a fictional narrative involving Marilyn Monroe and Albert Einstein's hypothetical children named "Bioconductor" and "tidyverse."
Let's break down the concepts mentioned in the article and provide relevant information:
-
RNA Sequencing (RNAseq): The article revolves around the context of gene expression data analysis using RNA sequencing. RNAseq is a powerful technique to study gene expression levels by sequencing RNA molecules.
-
Tidyverse and dplyr: Tidyverse is a collection of R packages designed for data science. The article specifically mentions the use of the dplyr package, which provides a grammar of data manipulation for easy and clean data analysis.
-
Bioconductor: Bioconductor is an open-source project that provides tools for the analysis and comprehension of high-throughput genomic data. In the article, it's humorously portrayed as less aesthetically pleasing but necessary for analyzing Next-Generation Sequencing (NGS) data.
-
Data Frame Creation: The article demonstrates the creation of a data frame (
expt
) containing gene expression data for different tissues (heart, kidney, brain) at two time points (1 and 2). -
Switching Between Tidyverse and Bioconductor Styles: It explains how to convert the data frame from tidyverse style to Bioconductor style (matrix) and vice versa. This involves manipulating the structure of the data frame and handling gene names as column names or as a separate column.
-
Adding Row Number as Another Column: The article covers adding a row number as a column, highlighting the flexibility and customization options in data manipulation.
-
Survival Tips in R World: The author humorously mentions trying to memorize the least number of functions to navigate the R programming world, emphasizing the practicality of learning essential functions for survival.
In summary, the article blends humor with practical insights into the challenges and techniques of working with RNAseq data, showcasing the interplay between tidyverse and Bioconductor in the context of bioinformatics analysis. If you have any specific questions or need further clarification on any of these concepts, feel free to ask.