Open wkumler opened 3 weeks ago
Conversion code:
import duckdb
conn = duckdb.connect('msdata.duckdb')
conn.execute("""
CREATE TABLE MS1 (
mz DOUBLE,
rt DOUBLE,
int DOUBLE
)
""")
for spectrum in mzml.MzML('210916_Poo_TruePooAmm-noIS-pos-Full_1.mzML'):
if spectrum['ms level'] == 1:
mz_vals = spectrum['m/z array']
int_vals = spectrum['intensity array']
rt_val = spectrum['scanList']['scan'][0]['scan start time']
df_scan = pd.DataFrame({'mz': mz_vals, 'int': int_vals, 'rt': [rt_val] * len(mz_vals)})
conn.execute("INSERT INTO MS1 SELECT * FROM df_scan")
conn.close()
Extraction code:
def get_chrom_duckdb(database):
conn = duckdb.connect(database)
query_data = conn.execute("SELECT * FROM MS1 WHERE ABS(mz-118.0865) <= 0.001").fetch_df()
conn.close()
return query_data
print(chrom_data)
chrom_data = get_chrom_duckdb("msdata.duckdb")
plt.plot(chrom_data["rt"], chrom_data["int"])
plt.show()
WHY IS IT SIDEWAYS
Adding to data frame is order dependent.
df_scan = pd.DataFrame({'mz': mz_vals, 'int': int_vals, 'rt': [rt_val] * len(mz_vals)})
does not add to mz, int, or rt based on name. It adds to the first, second, and third columns in the data frame.
DuckDB tasks:
For now it's fine to assume we have the memory required to do it all at once but eventually this will need to be refined