Working with paired end data

Valentin-Bio commented 1 year ago

Hello, I have 93 metagenomic samples, I first ran the stage_one program on the folliwjg manner:

args_oap stage_one -I all_guys -o args -f fastq.gz -t 20

By reading the "(optional) Single/Paired end files" section on the main GitHub page, it says that to make args_oap to consider paired end data, I have to consider that my samples ends with _1 | _2 followed by the specified format (-f).

My paired end reads have the _1.fastq.gz and _2.fastq.gz suffix, considering that I specified -f fastq.gz, these files must be considered as two files for one samples (paired end data) but reading the stdout message, the program is considering them as separated samples:

[2023-04-30 09:59:47] INFO: Processing <all_guys/CL_FP.BAC4A_ATATCTCG-ACTAAGAT_L00M_1.fastq.gz> (1/186) ... [2023-04-30 10:14:12] INFO: Processing <all_guys/CL_FP.BAC4A_ATATCTCG-ACTAAGAT_L00M_2.fastq.gz> (2/186) ... [2023-04-30 10:28:41] INFO: Processing <all_guys/CL_FP.BAC4B_GCGCTCTA-GTCGGAGC_L00M_1.fastq.gz> (3/186) ... [2023-04-30 10:36:09] INFO: Processing <all_guys/CL_FP.BAC4B_GCGCTCTA-GTCGGAGC_L00M_2.fastq.gz> (4/186) ... [2023-04-30 10:43:40] INFO: Processing <all_guys/CL_FP.BAC4C_AACAGGTT-CTTGGTAT_L00M_1.fastq.gz> (5/186) ... [2023-04-30 10:59:25] INFO: Processing <all_guys/CL_FP.BAC4C_AACAGGTT-CTTGGTAT_L00M_2.fastq.gz> (6/186) ... [2023-04-30 11:15:12] INFO: Processing <all_guys/CL_FP.BAC4D_GGTGAACC-TCCAACGC_L00M_1.fastq.gz> (7/186) ... [2023-04-30 11:19:22] INFO: Processing <all_guys/CL_FP.BAC4D_GGTGAACC-TCCAACGC_L00M_2.fastq.gz> (8/186) ... [2023-04-30 11:23:28] INFO: Processing <all_guys/CL_FP.BAC4E_CAACAATG-CCGTGAAG_L00M_1.fastq.gz> (9/186) ... [2023-04-30 11:30:16] INFO: Processing <all_guys/CL_FP.BAC4E_CAACAATG-CCGTGAAG_L00M_2.fastq.gz> (10/186) ... [2023-04-30 11:37:00] INFO: Processing <all_guys/CL_FP.BAC4F_TGGTGGCA-TTACAGGA_L00M_1.fastq.gz> (11/186) ... [2023-04-30 11:45:54] INFO: Processing <all_guys/CL_FP.BAC4F_TGGTGGCA-TTACAGGA_L00M_2.fastq.gz> (12/186) ... [2023-04-30 11:54:57] INFO: Processing <all_guys/CL_FP.BAC4G_AGGCAGAG-GGCATTCT_L00M_1.fastq.gz> (13/186) ... [2023-04-30 12:02:48] INFO: Processing <all_guys/CL_FP.BAC4G_AGGCAGAG-GGCATTCT_L00M_2.fastq.gz> (14/186) ... [2023-04-30 12:10:44] INFO: Processing <all_guys/CL_FP.BAC4H_GAATGAGA-AATGCCTC_L00M_1.fastq.gz> (15/186) ... [2023-04-30 12:17:59] INFO: Processing <all_guys/CL_FP.BAC4H_GAATGAGA-AATGCCTC_L00M_2.fastq.gz> (16/186) ... [2023-04-30 12:25:16] INFO: Processing <all_guys/CL_FP.BAC4I_TGCGGCGT-TACCGAGG_L00M_1.fastq.gz> (17/186) ... [2023-04-30 12:27:11] INFO: Processing <all_guys/CL_FP.BAC4I_TGCGGCGT-TACCGAGG_L00M_2.fastq.gz> (18/186) ... [2023-04-30 12:29:07] INFO: Processing <all_guys/CL_FP.BAC4J_CATAATAC-CGTTAGAA_L00M_1.fastq.gz> (19/186) ... [2023-04-30 12:29:07] INFO: Processing <all_guys/CL_FP.BAC4J_CATAATAC-CGTTAGAA_L00M_2.fastq.gz> (20/186) ...

First question:

regarding to this log: Is the program considering 186 samples or 186 files?

Second question:

After running the stage_two , I picked the normalized_16S.type.txt file for further analysis, will it be a good idea to just sum the normalized counts founded on each paired end file for the same sample ?

e.g.

CL_FP.BAC4J_CATAATAC-CGTTAGAA_L00M_2.fastq.gz counts + CL_FP.BAC4J_CATAATAC-CGTTAGAA_L00M_2.fastq.gz counts

Thanks for your time :)

bests,

Valentín.

xinehc commented 1 year ago

Hi,

regarding to this log: Is the program considering 186 samples or 186 files?

These 186 files will be merged into 93 samples after stage_two. You should see 93 columns in the final output. If not, then it might be a bug.

After running the stage_two , I picked the normalized_16S.type.txt file for further analysis, will it be a good idea to just sum the normalized counts founded on each paired end file for the same sample ?

If the files are not automatically merged you may consider first sum up the unnormalized count (_1 and _2 file) then divide it by the summation (_1 and _2 file) of 16S in metadata.txt.

HTH, Xi

Valentin-Bio commented 1 year ago

Thanks, the problem was that some of the file names have an underscore previous the last underscore.

e.g

readA_1_1.fastq.gz readA_2_2.fastq.gz

thanks! I renamed the files and everything is good now

xinehc / args_oap

Working with paired end data #25