wkumler / mzsql

A repository of data and code showing the efficiency of databases relative to existing mass-spectrometry database formats
0 stars 0 forks source link

DuckDB things #6

Open wkumler opened 3 weeks ago

wkumler commented 3 weeks ago

DuckDB tasks:

For now it's fine to assume we have the memory required to do it all at once but eventually this will need to be refined

wkumler commented 3 weeks ago

Conversion code:

import duckdb
conn = duckdb.connect('msdata.duckdb')
conn.execute("""
    CREATE TABLE MS1 (
        mz DOUBLE,
        rt DOUBLE,
        int DOUBLE
    )
""")
for spectrum in mzml.MzML('210916_Poo_TruePooAmm-noIS-pos-Full_1.mzML'):
    if spectrum['ms level'] == 1:
        mz_vals = spectrum['m/z array']
        int_vals = spectrum['intensity array']
        rt_val = spectrum['scanList']['scan'][0]['scan start time']
        df_scan = pd.DataFrame({'mz': mz_vals, 'int': int_vals, 'rt': [rt_val] * len(mz_vals)})
        conn.execute("INSERT INTO MS1 SELECT * FROM df_scan")
conn.close()
wkumler commented 3 weeks ago

Extraction code:

def get_chrom_duckdb(database):
    conn = duckdb.connect(database)
    query_data = conn.execute("SELECT * FROM MS1 WHERE ABS(mz-118.0865) <= 0.001").fetch_df()
    conn.close()
    return query_data
print(chrom_data)
chrom_data = get_chrom_duckdb("msdata.duckdb")

plt.plot(chrom_data["rt"], chrom_data["int"])
plt.show()

WHY IS IT SIDEWAYS

SamLaRue commented 3 weeks ago

Adding to data frame is order dependent.

df_scan = pd.DataFrame({'mz': mz_vals, 'int': int_vals, 'rt': [rt_val] * len(mz_vals)})

does not add to mz, int, or rt based on name. It adds to the first, second, and third columns in the data frame.