Gene_Centric_Coding function fails due to character-formatted annotation data

kwdoyle commented 8 months ago

Hello,

I've been working my way through each step in the pipeline and have encountered an error when running STAARpipeline_Gene_Centric_Coding.r

The error occurs in the internal coding function within the Gene_Centric_Coding function. When the current annotation name is "aPC.LocalDiversity", the script attempts to perform a transformation of the data:

if (Annotation_name[k] == "aPC.LocalDiversity") {
                    Annotation.PHRED.2 <- -10 * log10(1 - 10^(-Annotation.PHRED/10))
                    Annotation.PHRED <- cbind(Annotation.PHRED,
                      Annotation.PHRED.2)
                    Anno.Int.PHRED.sub.name <- c(Anno.Int.PHRED.sub.name,
                      paste0(Annotation_name[k], "(-)"))
                  }

However, the data in Annotation.PHRED is of the character class, causing this error:

Error in -Annotation.PHRED : invalid argument to unary operator
Calls: Gene_Centric_Coding -> coding

aPC.LocalDiversity is not the only character annotation. Of those picked from the name catalog, the only other character values are 3 other annotation PCs:

/cadd_phred  # numeric
/linsight  # numeric
/fathmm_xf  # numeric
/apc_epigenetics_active  # character
/apc_epigenetics_repressed  # character
/apc_epigenetics_transcription  # numeric
/apc_conservation  # numeric
/apc_local_nucleotide_diversity  # character (causing this error)
/apc_mappability  # character
/apc_transcription_factor  # numeric
/apc_protein_function  # numeric

Would anyone know why these values would be written to (or read from) the GDS file as characters?

xihaoli commented 8 months ago

Hi @kwdoyle,

Thank you for letting me know. Could I ask what FAVOR database files did you use to annotate your genotype data?

Best, Xihao

kwdoyle commented 8 months ago

I'm using the full database of 160 annotations hosted on Harvard Dataverse. More specifically, just the database file for chromosome 5 for now.

Interestingly enough, the Annotate.R script seems to work correctly. If I read in the saved, merged annotation file "Anno_chr5_STAARpipeline.csv" with data.table::fread, all of the data indicated in my previous post as being character-formatted are read in as numeric.

So presumably the issue occurs somewhere in the gds2agds.R script?

It also might be of interest that I chose to annotate the gds using all 160 annotations (rather, I set my 'anno_colnum' variable as c(1, 8:190). I chose the same initial annotations as the default (1 and 8) and then selected all variables from 8 to the end. I'm not sure if this is relevant, however, as the file from Annotate.R seems fine

kwdoyle commented 8 months ago

Oh, while looking through the scripts, I think this part in gds2agds.R might be the problem:

FunctionalAnnotation <- read_csv(paste0(dir_anno,"chr",chr,"/",anno_file_name_1,chr,anno_file_name_2),
col_types=list(col_character(),col_double(),col_double(),col_double(),col_double(),
col_double(),col_double(),col_double(),col_double(),col_double(),
col_character(),col_character(),col_character(),col_double(),col_character(),
col_character(),col_character(),col_character(),col_character(),col_double(),
col_double(),col_character()))

The character values I'm seeing might be in the location of these hard-coded character columns..

xihaoli commented 8 months ago

Hi @kwdoyle,

Thank you so much for taking a closer look at the issue. Yes indeed, the current FAVORannotator script in the STAARpipeline-Tutorial works well for the FAVOR Essential Database hosted on Harvard Dataverse. For the full database of 160 annotations, it is recommended to adapt the col_types in gds2agds.R to reflect the correct column type for each annotation.

Best, Xihao

kwdoyle commented 8 months ago

Yes, this was indeed the issue. If I removed the col_types specification and let read_csv auto-assign the column classes, everything seems to work fine.

xihaoli commented 8 months ago

Thank you @kwdoyle! Feel free to let me know if you have any other questions, or you may close this issue.

xihaoli / STAARpipeline-Tutorial

Gene_Centric_Coding function fails due to character-formatted annotation data #36