westurner / pandasrdf

pandasrdf integrates Pandas and RDF
BSD 3-Clause "New" or "Revised" License
3 stars 0 forks source link

RLS: Roadmap #1

Open westurner opened 10 years ago

westurner commented 10 years ago

ENH: Linked Datasets (RDF)

(original: https://github.com/pydata/pandas/issues/3402)

Use Case

So I:

and I want to share my findings so that others can find, review, repeat, reproduce, and verify (confirm/reject) a given conclusion.

User Story

As a data analyst, I would like to share or publish Series, DataFrames, Panels, and Panel4Ds as structured, hierarchical, RDF linked data ("DataSet").

Status Quo: Pandas IO

How do I go from a [CSV] to a DataFrame to something shareable with a URL?

http://pandas.pydata.org/pandas-docs/dev/io.html

.

Read or parse a data format into a DataSet:

Add metadata:

Save or serialize a DataSet into a data format:

Share or publish a serialized DataSet with the internet:

What changes would be needed for Pandas core to support this workflow?

It's easy enough to serialize a dict and a table to naieve RDF.

For interoperability, it would be helpful to standardize with a common set of terms/symbols/structures/schema for describing the tabular, hierarchical data which pandas is designed to handle.

There is currently no standard method for storing columnar metadata within Pandas (e.g. in .meta['columns'][colname]['schema'], or as a JSON-LD @context).

Ontology Resources

https://en.wikipedia.org/wiki/Comma-separated_values

http://pandas.pydata.org/pandas-docs/dev/io.html

Arguments:

.

http://pandas.pydata.org/pandas-docs/dev/remote_data.html

Arguments to read_rdf would need to describe which dimensions of data to read into 1D/2D/3D/4D form.

@datastep / PROV

Ten Simple Rules for Reproducible Computational Research (3, 4, 5, 7, 8, 10)

DataCatalog

A collection of Datasets.

Linked Data Abstractions

URIs and URLs

SQL and Linked Data

Named Graphs

Linked Data Formats

Choosing Schema

DataSets have [implicit] URIs:

http://example.com/datasets/#<key>

Shared or published DataSets have URLs:

http://ckan.example.org/datasets/<key>

DataSets are about certain things:

e.g. URIs for #Tags, Categories, Taxonomy, Ontology

DataSets are derived from somewhere, somehow:

Datasets have structure:

http://5stardata.info/ http://www.w3.org/TR/ld-glossary/#x5-star-linked-open-data

☆ Publish data on the Web in any format (e.g., PDF, JPEG) accompanied by an explicit Open License (expression of rights). ☆☆ Publish structured data on the Web in a machine-readable format (e.g., XML). ☆☆☆ Publish structured data on the Web in a documented, non-proprietary data format (e.g., CSV, KML). ☆☆☆☆ Publish structured data on the Web as RDF (eg Turtle, RDFa, JSON-LD, SPARQL) ☆☆☆☆☆ In your RDF, have the identifiers be links (URLs) to useful data sources.

https://en.wikipedia.org/wiki/Linked_Data

westurner commented 10 years ago

https://github.com/mhausenblas/omnidator

https://github.com/mhausenblas/schema-org-rdf/blob/master/tools/schema-gateway/schema_org_processor.py

westurner commented 8 years ago

Added:

westurner commented 7 years ago