withastro / roadmap

Ideas, suggestions, and formal RFC proposals for the Astro project.
292 stars 29 forks source link

Referencing data from content collections #530

Closed bholmesdev closed 1 year ago

bholmesdev commented 1 year ago

Details

Summary

Introduce a standard to store data separately from your content (ex. JSON files), with a way to "reference" this data from existing content collections.

Background & Motivation

Content collections are restricted to supporting .md, .mdx, and .mdoc files. This is limiting for other forms of data you may need to store, namely raw data formats like JSON.

Taking a blog post as the example, there will likely be author information thats reused across multiple blog posts. To standardize updates when, say, updating an author's profile picture, it's best to store authors in a separate data entry, with an API to reference this data from any blog post by ID.

The content collections API was built generically to support this future, choosing format-agnostic naming like data instead of frontmatter and body instead of rawContent. Because of this, expanding support to new data formats without API changes is a natural progression.

Use cases

We have a few use cases in mind considering data collections and data references. We expect this list to grow through the RFC discussion and learning from our community!

Goals

Non-goals

bholmesdev commented 1 year ago

We've discussed some initial examples of how data collections could work, including querying and referencing. This was informed through @tony-sull and I's work on the astro.build site.

Ex: Creating a collection of JSON

Say you have a collection of blog post authors you would like to store as JSON. You can create a new collection under src/data/ like so:

src/data/
  authors/
    ben.json
    fred.json
    matthew.json

This collection can be configured with a schema like any other content collection. To flag the collection as data-specific, we may expose a new defineDataCollection() helper:

// src/content/config.ts
import { defineDataCollection, z } from 'astro:content';

const authors = defineDataCollection({
    schema: z.object({
        name: z.string(),
        twitter: z.string().url(),
    })
});

export const collections = { authors };

It can also be queried like any other collection, this example using getDataCollection('authors'):

---
import { getDataCollection } from 'astro:content';
const authors = await getDataCollection('authors');
---
<ul>
{authors.map(author => (
    <li>
        <a href={author.data.twitter}>{author.data.name}</a>
    </li>
)}
</ul>

Return type

Data collections will return a subset of fields exposed by content collections:

type DataCollectionEntry<C> = {
  id: string;
  data: object;
  collection: C;
}

This omits a few key fields:

Ex: Referencing data collections

Data collections could be referenced from existing content collection schemas. One example may be a reference() function (see @tony-sull 's early experiment) to reference data collection entries by slug from your frontmatter.

This example allows you to list all blog post authors from each blog post:

src/content/config.ts
import { defineCollection, defineDataCollection, reference, z } from 'astro:content'

const blog = defineCollection({
  schema: z.object({
    title: z.string(),
    authors: z.array(reference("authors")),
  })
});

const authors = defineDataCollection({
  schema: z.object({
    name: z.string(),
    avatar: image(),
  })
})

export const collections { blog, authors };

Then, authors can be referenced by slug from each blog entry's frontmatter. This should validate each slug and raise a readable error for invalid authors:

---
title: Astro 2.0 launched
authors:
- fred-schott
- ben-holmes
---
FredKSchott commented 1 year ago

How do you differentiate between 1:1 and 1:N? Is relation always ["tony-sull"] and never "tony-sull"? It could be enough to implement 1:1 by doing relation("author").length(5) but then you're always stuck with an array on both the frontmatter and the query side of things.

We should try to support 1:1 relations if we can, or if we can't make sure that's mentioned in the RFC.

This may even make sense to add as an explicit goal of the stage 2 proposal, since I'd argue the query DX hit of always needing to unwrap an array for 1:1 relation isn't acceptable.

Other initial thoughts:

FredKSchott commented 1 year ago
  • User-facing APIs to introduce new data collection formats like YAML or TOML. We recognize the value of community plugins to introduce new formats, and we will experiment with a pluggable API internally. Still, a finalized user-facing API will be considered out-of-scope.

Another reason to avoid this for now: if the whole reference system works by referencing an id/slug, then having a single large CSV doesn't guarantee a column as id/slug. We'd need some additional config to define which column is the primary key.

bholmesdev commented 1 year ago

@FredKSchott Thanks for the suggestions! Think I agree with all of these. Thoughts on the API design:

  1. That content config is valid, just writing the collections in-line instead of creating variables to export later. Copied from Tony's stage 1 proposal. Refactoring to the docs recommendations for readability.
  2. Agreed that ref or reference are better names. I lean towards reference to avoid colliding with state management concepts from Vue et al.
  3. I agree 1-1 vs 1-many should be a standalone goal! I'll admit I was wondering this too but left it out in the example. Playing with a few ideas, I'm liking this early design:
...
// Reference a single author by id (default)
author: reference('authors'),
// Reference multiple authors in a list of IDs
authors: reference('authors', { relation: 'many' }),
  1. The more I scope this RFC, the more I really like a separate src/data/ directory. This lets us play with ideas like removing the render() and slug conventions without much effort.
bholmesdev commented 1 year ago

relation("authors").default(["tony-sull"]) That default took a second for me to parse, but I think that makes sense (vs. not supporting it at all). How much of the Zod API does relation() implement? Hopefully all of it?

Looking into this, I don't think we can support default() and optional() chaining in this way. Under-the-hood, reference() would transform the input ID to the actual data that ID points to:

export function reference(id) {
  const dataModule = import(resolveIdToDataImport(id));
  return z.object(dataModule);
}

This means helpers like .transform() and .refine() can work as expected, in case you want to massage data further (see the image().refine(...) example for our experimental assets). But since data is already resolved, default in particular wouldn't work (optional() might). I think each of these should be parameters instead, if supported:

...
authors: z.reference('authors', { default: 'ben-holmes' });
tony-sull commented 1 year ago

I'd definitely prefer the z.array(reference()) syntax if there aren't too many tradeoffs

For me I can't think of many uses for .default() or .transform() inside of z.array(), if that's the main tradeoff. I might want to default the array itself, but I'm not sure if .default() would ever run on an individual array item. For .transform(), would there be any performance improvement trying to transform each array item individually vs. transforming the full array once it resolves?

bholmesdev commented 1 year ago

Thanks @tony-sull, I agree that's intuitive! I'm wrestling with whether reference() should A) transform IDs to your data directly with a Zod transform, or B) avoid the transform and return some flag to tell Astro "hey! Post-process this Zod object key please!". These are the tradeoffs in what Zod extension functions we could support:

// Solution a)
author: reference('authors').transform(data => ...), // ✅ works
author: reference('authors').refine(data => ...), // ✅ works
author: reference('authors').default('ben-holmes') // ❌ Doesn't work. Data already resolved!
authors: z.array(reference('authors')) // ❌ Doesn't work. Data already resolved!

// Solution b)
author: reference('authors').transform(data => ...), // ❌ Doesn't work. Data not resolved yet!
author: reference('authors').refine(data => ...), // ❌ Doesn't work. Data not resolved yet!
author: reference('authors').default('ben-holmes') // ✅ works
authors: z.array(reference('authors')) // ✅ works

If we want to have Zod both ways, we need to add configuration options for the functions we don't support.

Ex. if we support .transform() and .refine(), we'd need the following for array and default:

authors: reference('authors', { array: true, default: 'tony-sull' }),

From what I've seen, I expect users to lean on array and default more than transform and refine. So I'm starting to agree that is the better way to go 👍

bholmesdev commented 1 year ago

@tony-sull Just marked my comment above as outdated because... I'm wrong! Zod transforms run separately, so you can totally do z.array(...) around a transformer and have it still work. Now I'm 100% on-board with your suggestion 👍

tony-sull commented 1 year ago

@bholmesdev Excellent! I wasn't actually sure if that setup would work, glad that does the trick! 🚀

bholmesdev commented 1 year ago

Discussion on single-file data collections vs. multi-file

There have been a few mentions of support a single file to store all data collection entries as an array, instead of splitting up entries per-file as we do content collections today. This would mean, say, a single src/content/authors.json file instead of a few src/content/authors/[name].json files.

Investigating this, I think it's best to stick with multiple files instead. Reasons:

The reasons in favor of [collection].json:

tony-sull commented 1 year ago

I'd definitely lean towards multiple files for data collections, at least for the use cases I can think of. CSVs are an interesting one, but I could even see wanting multiple CSV files for something like a "database" of transactions grouped by month

src/content vs src/data is one I could definitely go either way on! Both have pros and cons, a couple ideas I thought about while listening to the community call today:

EddyVinck commented 1 year ago

I am working on a template for a talk I'll be giving at soon-to-be-announced conference, where I want to show people how they can setup an engineering blog for themselves or a multi-author blog for their company.

A feature like relational data from JSON would be really neat to have! In the meantime I manually hooked it all up based on a string ID.

A few notes on the earlier conversation here:

Just an idea I'd like to throw in here: what if you could do getCollection('blog') for the current behavior, and getCollection({ name: 'author', type: 'json' }) for the data collection? I haven't looked at the internals of getCollection much but from my perspective as mostly an Astro user it makes sense, and it would open up possibilities for other type options without adding even more getThingCollection() functions and having to import those.

bholmesdev commented 1 year ago

Ah, that's great to hear @EddyVinck! I'm working to have a preview release by end-of-day tomorrow. I'll share that branch here once it's up.

Just an idea I'd like to throw in here: what if you could do getCollection('blog') for the current behavior, and getCollection({ name: 'author', type: 'json' }) for the data collection?

This is an interesting idea! Though I will admit, I'm not sure if type should be tied to file extensions. The goal is to separate based on shape of the response (note content and data have different return shapes), and each file extension of a given type adheres to that shape. In other words, file extension shouldn't matter when you're querying; just the shape of the response. So far I've considered 3 possible types:

I also worry that the type API reads like a type cast, implying you could import a content collection as a data collection. Though I also see a parallel to import assertions which could be nice. Either way, since I don't see us having more than the 3 shapes outlined above, I think getDataCollection() is a compromise to avoid breaking changes. We'll think over it though!

bholmesdev commented 1 year ago

Well I'm a man of my word! Here's a preview branch + video overview of the new data collection APIs. Still waiting on the preview release CI action (something's holding it up...) but you should be able to clone the repo and try the examples/with-data starter 🚀

https://github.com/withastro/astro/pull/6850

connor-baer commented 1 year ago

Looks like the link got cut off, here's the actual PR: https://github.com/withastro/astro/pull/6850

bholmesdev commented 1 year ago

Discussion on src/data/ vs. src/content/

We've considered two options for storying data collections: using the same src/content/ directory we have today, or introducing a new, separate src/data/ directory.

Why src/data/

Why src/content/

Conclusion

There are compelling pros on both sides. Today, we have a deploy preview using src/data/ to get user feedback before making a final decision. Though based on the API bash with our core team and feedback below, using src/content/ for everything could be the more intuitive API.

tony-sull commented 1 year ago

This one's really minor, but I also like that reusing src/content/ means Astro isn't claiming another special directory in src

That means one less major breaking change and less chance of getting in the user's way if someone wants their own src/data directory

jasikpark commented 1 year ago

Using src/content also allows for colocating your data with your content, where you could even put frontmatter you might end up putting in a markdown file in a separate json data file

bholmesdev commented 1 year ago

@jasikpark So as part of this, I don't expect data and content to be able to live in the same collection. We'd require users to specify the type of a collection in their config file (i.e. defineCollection({ type: 'data' ... })). This is arguably a point against supporting everything in src/content/ since it may encourage users to try this pattern when it is not supported.

jasikpark commented 1 year ago

ohhhhhh i forgot that

my-post/
    content.mdoc
    post-image.jpeg
    post-image2.webp
    data.json

wouldn't be supported...

hmm i dunno how i feel about either directory then 🙈

bholmesdev commented 1 year ago

@jasikpark Well that would be supported still, as long as you add an underscore _ to the file name to mark as ignored in our type checker. We won't run mixed content and data through the same Zod schema. So colocation is fine, but mixed validation is not

jasikpark commented 1 year ago

Ok - I guess I've been thinking of a collection entry as a folder rather than a markdoc file 😓 good to understand that better

bholmesdev commented 1 year ago

@jasikpark That's a model keystatic has adopted actually! Since nested directories are used for slug / id generation, we haven't used this same model. It almost reminds me of the NextJS app/ directory vs. our current routing story.

jasikpark commented 1 year ago

cool, i'll play around w/ that then - thx for all the responses 💜

how does it make you think of the app folder for nextjs?

bholmesdev commented 1 year ago

@jasikpark Well, it's the difference between file vs. directory-based routing. I think content collections and Astro's pages/ router have a lot of parallels, including the underscore _ for colocation. The other solution is to use wrapper directories for everything, where key files have a special name (like page.tsx or content.mdoc) with colocation allowed for any other files. I've heard thoughts on supporting both, which is interesting!

jasikpark commented 1 year ago

ooohhhh thx for clarifying, that's interesting yeah

ematipico commented 1 year ago

I'm more in favour of src/content, mostly because reserving another directory feels too much. This doesn't mean that we can't have an src/content/data folder, although it might NOT make sense because they are two different concepts.

bholmesdev commented 1 year ago

Thanks for the input y'all! Implemented src/content/ on the latest PR. Stage 3 RFC to come

matthewp commented 1 year ago

Closing as this is completed and in stable.