RLS: Roadmap - Githubissues

westurner commented 10 years ago

ENH: Linked Datasets (RDF)

This is very much a meta issue.
There are a number of bare links here.
They are for documentation

(original: https://github.com/pydata/pandas/issues/3402)

Use Case

So I:

retrieved some data
- from somewhere
- about a certain #topic
perfomed analysis
- with certain transformations and aggregations
- with certain versions of certain tools
- confirmed/rejected a [null] hypothesis

and I want to share my findings so that others can find, review, repeat, reproduce, and verify (confirm/reject) a given conclusion.

User Story

As a data analyst, I would like to share or publish Series, DataFrames, Panels, and Panel4Ds as structured, hierarchical, RDF linked data ("DataSet").

Status Quo: Pandas IO

How do I go from a [CSV] to a DataFrame to something shareable with a URL?

http://pandas.pydata.org/pandas-docs/dev/io.html

.

Series (1D)
- index
- data
- NumPy datatypes
DataFrame (2D)
- index
- column(s)
- NumPy datatypes
Panel (3D)
Panel4D (4D)

Read or parse a data format into a DataSet:

pandas.read_*
- read_clipboard
- read_csv
- read_excel
- read_fwf
- read_gbq
- read_hdf
- read_html
- read_json
- read_msgpack
- read_pickle
- read_sql
- read_stata
- read_table
pandas.HDFStore
- https://pandas.pydata.org/docs/dev/io.html#hdf5-pytables

Add metadata:

[ ] Add RDF metadata (RDFa, JSONLD)

Save or serialize a DataSet into a data format:

pandas.DataFrame.
- to_csv
- to_dict
- to_excel
- to_gbq
- to_html
- to_latex
- to_panel
- to_period
- to_records
- to_sparse
- to_sql
- to_stata
- to_string
- to_timestamp
- to_wide
[ ] to_ RDF
[ ] to_ CSVW
[ ] to_ HTML + RDFa
[ ] to_ JSONLD
- [ ] create a JSONLD @context

Share or publish a serialized DataSet with the internet:

Email Attachment (Table in a PDF)
- opendatahandbook.org
- project-open-data.github.io
FTP, SFTP, RSYNC, NFS
HTML web upload form with metadata form fields
CLI tool
Version Control: Git, Hg, Svn
- challenge: 'large' files ("binary blobs") in VCS systems
HTTP API: Object Storage (~LDP)
- GET/POST /container/filename.csv # [.json|.xml|.xls|.rdf|.html]
- challenge: indexing metadata from a separate document / named graph
- GET/POST to/container/filename.csv`
Push to CKAN
Host DataSet metadata
- python -m SimpleHTTPServer 8088
- e.g. http://datasets.schema-labs.appspot.com/about Indexes http://schema.org/Dataset s
  Implementation

What changes would be needed for Pandas core to support this workflow?

.meta schema
to_rdf for Series, DataFrames, Panels, and Panel4Ds
read_rdf for Series, DataFrames, Panels, and Panel 4Ds
~@datastep process decorators
~DataSet
~DataCatalog of precomputed aggregations/views/slices.
Units support (.meta?)
.meta schema

It's easy enough to serialize a dict and a table to naieve RDF.

For interoperability, it would be helpful to standardize with a common set of terms/symbols/structures/schema for describing the tabular, hierarchical data which pandas is designed to handle.

There is currently no standard method for storing columnar metadata within Pandas (e.g. in .meta['columns'][colname]['schema'], or as a JSON-LD @context).

Ontology Resources

http://www.w3.org/TR/rdf-schema/ (rdfs:)
http://www.w3.org/TR/owl-overview/ (owl:)
http://www.w3.org/TR/sparql11-query/#sparqlDefinition
http://lov.okfn.org
http://prefix.cc
CSV2RDF (csvw:)
http://www.w3.org/ns/csvw
https://github.com/w3c/csvw
https://w3c.github.io/csvw/

https://en.wikipedia.org/wiki/Comma-separated_values

https://tools.ietf.org/html/rfc4180
W3C PROV (prov:)
http://www.w3.org/TR/prov-primer/#intuitive-overview-of-prov
http://www.w3.org/TR/prov-o/
http://www.w3.org/2011/prov/wiki/ProvImplementations
- https://pypi.python.org/pypi/prov
- http://prov.readthedocs.org/en/latest/usage.html
  schema.org (schema:)
http://schema.org
http://www.w3.org/wiki/WebSchemas
http://schema.rdfs.org/
https://schema.org/docs/full.html :
- schema:Dataset -- A body of structured information describing some topic(s) of interest.
- [schema:Thing, schema:CreativeWork]
- distribution -- A downloadable form of this dataset, at a specific location, in a specific format (DataDownload)
- spatial, temporal
- catalog -- A data catalog which contains a dataset (DataCatalog)
- schema:DataCatalog -- collection of Datasets
- [schema:Thing, schema:CreativeWork]
- dataset -- A dataset contained in a catalog. (Dataset)
- schema:DataDownload -- A dataset in downloadable form.
- [schema:Thing, schema:CreativeWork]
- contentSize
- contentURL
- uploadDate
  W3C RDF Data Cube (qb:)
http://www.w3.org/TR/vocab-data-cube/
http://www.w3.org/2011/gld/wiki/Data_Cube_Vocabulary#The_history_of_Data_Cube.2C_SDMX-RDF_and_SCOVO
http://www.w3.org/TR/vocab-data-cube/#vocab-reference :
- qb:DataSet -- a collection of observations, possibly organized into various slices, conforming to some common dimensional structure
- qb:Slice -- a subset of a DataSet defined by fixing a subset of the dimensional values.
- qb:Observation -- a single observation in the cube, may have one or more associated measured values.
- qb:dataset -- data set of which this observation is a part.
- qb:ObservationGroup -- a, possibly arbitrary, group of observations.
- qb:observation -- an observation contained within this slice of the data set.
- qb:Slice -- a subset of a DataSet defined by fixing a subset of the dimensional values, component properties on the Slice.
- [Components, Properties, Dimensions, Attributes, Measures]
  to_rdf

http://pandas.pydata.org/pandas-docs/dev/io.html

Arguments:

[ ] output fmt
[ ] JSON-LD: compaction

.

[ ] Series.meta
[ ] Series.to_rdf()
[ ] DataFrame.meta
[ ] DataFrame.to_rdf()
[ ] Panel.meta
[ ] Panel.to_rdf()
[ ] Panel4D.meta
[ ] Panel4D.to_rdf()
read_rdf

http://pandas.pydata.org/pandas-docs/dev/remote_data.html

[ ] Series.read_rdf()
[ ] DataFrame.read_rdf()
[ ] Panel.read_rdf()
[ ] Panel4D.read_rdf()

Arguments to read_rdf would need to describe which dimensions of data to read into 1D/2D/3D/4D form.

@datastep / PROV

[ ] Objective: Additive journal of transformations
[ ] Link to source script(s) URIs
[ ] Decorator for annotating data transformations with metadata.
[ ] Generate PROV metadata for data transformations

Ten Simple Rules for Reproducible Computational Research (3, 4, 5, 7, 8, 10)

DataCatalog

A collection of Datasets.

[ ] DataCatalog = {that=df1, this=df1.group().apply(), also_this=df2]
- 'this is an aggregation of that'
- 'this' has a URI
- 'that' has a URI
What if there is no metadata for df2?
Units support
[ ] Series.meta
[ ] DataFrame.column.meta
NumPy [, PyTables]
http://docs.scipy.org/doc/numpy/reference/arrays.dtypes.html
https://pint.readthedocs.org/en/latest/
http://pythonhosted.org/quantities/
RDF Datatypes
http://en.wikipedia.org/wiki/ISO_8601
http://www.w3.org/TR/xmlschema-2/#decimal
http://schema.org/Date
http://schema.org/DateTime
http://schema.org/Float
http://schema.org/Quantity
https://github.com/RDFLib/rdflib
- from rdflib.namespace import XSD, RDF, RDFS
- from rdflib import URIRef, Literal
- https://github.com/RDFLib/rdflib-sqlalchemy (SQLAlchemy)
  JSON-LD RDF
https://github.com/digitalbazaar/pyld (JSON-LD)
https://github.com/RDFLib/rdflib-jsonld (JSON-LD)
Linked Data Primer

Linked Data Abstractions

Graphs are represented as triples of (s,p,o)
Subject, Predicate, Object
Queries are patterns with ?references
- graph.triples((None, None, None))
- SELECT ?s, ?p, ?o WHERE { ?s ?p ?o };
subjects are linked to objects by predicates
- subjects and predicate are identified by URI 'keys'

URIs and URLs

a URI is like a URL
usually, we expect URLs to be 'dereferencable` HTTP URIs
- HTTP GET http://en.wikipedia.org/
a URI may start with a different URI prefix
- urn:
- uuid:

SQL and Linked Data

there exist standard mappings for whole SQL tablesets
- rdb2rdf
- similar to application scaffolding
- ACL support adds complexity
virtuoso supports SQL and RDF and SPARQL
- standard mappings
- virtuoso powers http://dbpedia.org/
- dbpedia.org has a high degree of centrality
  - http://lod-cloud.net/
rdflib-sqlalchemy maps RDF onto SQL tables
- fairly inefficiently, when compared to native triplestores

Named Graphs

Quads: (g, s, p, o)
g: sometimes called the 'context' of a triple
Metadata about GRAPH ?g
Multiple named graphs in one file: TriX, TriG

Linked Data Formats

[ ] NTriples
[ ] RDF/XML
- [ ] TriX
[ ] Turtle, N3
- [ ] TriG
[ ] JSON-LD

Choosing Schema

[ ] XSD, RDF, RDFS, DCTERMS
Which schema is most popular?
Which schema is a best fit for the data?
Which schema will search engines index for us?
What do the queries look like?
Years Later... What is OWL?
Why would we start with RDFS now?
Linked Data Process, Provenance, and Schema

DataSets have [implicit] URIs:

http://example.com/datasets/#<key>

Shared or published DataSets have URLs:

http://ckan.example.org/datasets/<key>

DataSets are about certain things:

e.g. URIs for #Tags, Categories, Taxonomy, Ontology

DataSets are derived from somewhere, somehow:

where and how was it downloaded? (digital sense)
how was it collected? (process control sense)

Datasets have structure:

Tabular, Hierarchical
1D, 2D, 3D, 4D
Graph-based
- Chains
- Flows
Schema
5 ★ Open Data

http://5stardata.info/ http://www.w3.org/TR/ld-glossary/#x5-star-linked-open-data

☆ Publish data on the Web in any format (e.g., PDF, JPEG) accompanied by an explicit Open License (expression of rights). ☆☆ Publish structured data on the Web in a machine-readable format (e.g., XML). ☆☆☆ Publish structured data on the Web in a documented, non-proprietary data format (e.g., CSV, KML). ☆☆☆☆ Publish structured data on the Web as RDF (eg Turtle, RDFa, JSON-LD, SPARQL) ☆☆☆☆☆ In your RDF, have the identifiers be links (URLs) to useful data sources.

https://en.wikipedia.org/wiki/Linked_Data

westurner commented 10 years ago

https://github.com/mhausenblas/omnidator

https://github.com/mhausenblas/schema-org-rdf/blob/master/tools/schema-gateway/schema_org_processor.py

westurner commented 8 years ago

Added:

[ ] to_ CSVW

westurner commented 7 years ago

Is tracking columnar metadata across merges easier with Series.meta (than with DataFrame.meta.columns[name].meta)?

westurner / pandasrdf

RLS: Roadmap #1

ENH: Linked Datasets (RDF)

Use Case

User Story

Status Quo: Pandas IO

Implementation

`.meta` schema

Ontology Resources

CSV2RDF (`csvw:`)

W3C PROV (`prov:`)

schema.org (`schema:`)

W3C RDF Data Cube (`qb:`)

`to_rdf`

`read_rdf`

@datastep / PROV

DataCatalog

Units support

RDF Datatypes

JSON-LD RDF

Linked Data Primer

Linked Data Process, Provenance, and Schema

5 ★ Open Data

westurner / pandasrdf

RLS: Roadmap #1

ENH: Linked Datasets (RDF)

Use Case

User Story

Status Quo: Pandas IO

Implementation

.meta schema

Ontology Resources

CSV2RDF (csvw:)

W3C PROV (prov:)

schema.org (schema:)

W3C RDF Data Cube (qb:)

to_rdf

read_rdf

@datastep / PROV

DataCatalog

Units support

RDF Datatypes

JSON-LD RDF

Linked Data Primer

Linked Data Process, Provenance, and Schema

5 ★ Open Data

`.meta` schema

CSV2RDF (`csvw:`)

W3C PROV (`prov:`)

schema.org (`schema:`)

W3C RDF Data Cube (`qb:`)

`to_rdf`

`read_rdf`