What changes would be needed for Pandas core to support this workflow?
.meta schema
to_rdf for Series, DataFrames, Panels, and Panel4Ds
read_rdf for Series, DataFrames, Panels, and Panel 4Ds
~@datastep process decorators
~DataSet
~DataCatalog of precomputed aggregations/views/slices.
Units support (.meta?)
.meta schema
It's easy enough to serialize a dict and a table to naieve RDF.
For interoperability, it would be helpful to standardize with a common
set of terms/symbols/structures/schema for describing
the tabular, hierarchical data which pandas is designed to handle.
There is currently no standard method for storing columnar metadata
within Pandas (e.g. in .meta['columns'][colname]['schema'], or as a JSON-LD @context).
☆ Publish data on the Web in any format (e.g., PDF, JPEG) accompanied by an explicit Open License (expression of rights).
☆☆ Publish structured data on the Web in a machine-readable format (e.g., XML).
☆☆☆ Publish structured data on the Web in a documented, non-proprietary data format (e.g., CSV, KML).
☆☆☆☆ Publish structured data on the Web as RDF (eg Turtle, RDFa, JSON-LD, SPARQL)
☆☆☆☆☆ In your RDF, have the identifiers be links (URLs) to useful data sources.
ENH: Linked Datasets (RDF)
(original: https://github.com/pydata/pandas/issues/3402)
Use Case
So I:
and I want to share my findings so that others can find, review, repeat, reproduce, and verify (confirm/reject) a given conclusion.
User Story
As a data analyst, I would like to share or publish
Series
,DataFrame
s,Panel
s, andPanel4D
s as structured, hierarchical, RDF linked data ("DataSet").Status Quo: Pandas IO
http://pandas.pydata.org/pandas-docs/dev/io.html
.
Read or parse a data format into a DataSet:
pandas.read_*
read_clipboard
read_csv
read_excel
read_fwf
read_gbq
read_hdf
read_html
read_json
read_msgpack
read_pickle
read_sql
read_stata
read_table
pandas.HDFStore
Add metadata:
Save or serialize a DataSet into a data format:
pandas.DataFrame.
to_csv
to_dict
to_excel
to_gbq
to_html
to_latex
to_panel
to_period
to_records
to_sparse
to_sql
to_stata
to_string
to_timestamp
to_wide
Share or publish a serialized DataSet with the internet:
GET/POST /container/filename.csv
# [.json|.xml|.xls|.rdf|.html]GET/POST to
/container/filename.csv`python -m SimpleHTTPServer 8088
Implementation
What changes would be needed for Pandas core to support this workflow?
.meta
schemato_rdf
for Series, DataFrames, Panels, and Panel4Dsread_rdf
for Series, DataFrames, Panels, and Panel 4Ds@datastep
process decoratorsDataSet
DataCatalog
of precomputed aggregations/views/slices..meta
?).meta
schemaIt's easy enough to serialize a dict and a table to naieve RDF.
For interoperability, it would be helpful to standardize with a common set of terms/symbols/structures/schema for describing the tabular, hierarchical data which pandas is designed to handle.
There is currently no standard method for storing columnar metadata within Pandas (e.g. in
.meta['columns'][colname]['schema']
, or as a JSON-LD@context
).Ontology Resources
rdfs:
)owl:
)CSV2RDF (
csvw:
)https://en.wikipedia.org/wiki/Comma-separated_values
W3C PROV (
prov:
)schema.org (
schema:
)W3C RDF Data Cube (
qb:
)to_rdf
http://pandas.pydata.org/pandas-docs/dev/io.html
Arguments:
fmt
.
Series.meta
Series.to_rdf()
DataFrame.meta
DataFrame.to_rdf()
Panel.meta
Panel.to_rdf()
Panel4D.meta
Panel4D.to_rdf()
read_rdf
http://pandas.pydata.org/pandas-docs/dev/remote_data.html
Series.read_rdf()
DataFrame.read_rdf()
Panel.read_rdf()
Panel4D.read_rdf()
Arguments to
read_rdf
would need to describe which dimensions of data to read into 1D/2D/3D/4D form.@datastep / PROV
Ten Simple Rules for Reproducible Computational Research (3, 4, 5, 7, 8, 10)
DataCatalog
A collection of Datasets.
DataCatalog = {that=df1, this=df1.group().apply(), also_this=df2]
Units support
Series.meta
DataFrame.column.meta
RDF Datatypes
from rdflib.namespace import XSD, RDF, RDFS
from rdflib import URIRef, Literal
JSON-LD RDF
Linked Data Primer
Linked Data Abstractions
graph.triples((None, None, None))
SELECT ?s, ?p, ?o WHERE { ?s ?p ?o };
URIs and URLs
urn:
uuid:
SQL and Linked Data
Named Graphs
GRAPH ?g
Linked Data Formats
Choosing Schema
Linked Data Process, Provenance, and Schema
DataSets have [implicit] URIs:
Shared or published DataSets have URLs:
DataSets are about certain things:
DataSets are derived from somewhere, somehow:
Datasets have structure:
5 ★ Open Data
http://5stardata.info/ http://www.w3.org/TR/ld-glossary/#x5-star-linked-open-data
https://en.wikipedia.org/wiki/Linked_Data