Closed bholmesdev closed 1 year ago
We've discussed some initial examples of how data collections could work, including querying and referencing. This was informed through @tony-sull and I's work on the astro.build site.
Say you have a collection of blog post authors you would like to store as JSON. You can create a new collection under src/data/
like so:
src/data/
authors/
ben.json
fred.json
matthew.json
This collection can be configured with a schema like any other content collection. To flag the collection as data-specific, we may expose a new defineDataCollection()
helper:
// src/content/config.ts
import { defineDataCollection, z } from 'astro:content';
const authors = defineDataCollection({
schema: z.object({
name: z.string(),
twitter: z.string().url(),
})
});
export const collections = { authors };
It can also be queried like any other collection, this example using getDataCollection('authors')
:
---
import { getDataCollection } from 'astro:content';
const authors = await getDataCollection('authors');
---
<ul>
{authors.map(author => (
<li>
<a href={author.data.twitter}>{author.data.name}</a>
</li>
)}
</ul>
Data collections will return a subset of fields exposed by content collections:
type DataCollectionEntry<C> = {
id: string;
data: object;
collection: C;
}
This omits a few key fields:
Content
component. Data collections have no HTML to render, so the function is removed.id
, like a permalink. Since data collections are not meant to be used as pages, this is omitted.body
a double meaning depending on the context: non-data information for content collections, and the "raw" data itself for data collections. We can avoid returning the body
for an initial release to avoid this confusion.Data collections could be referenced from existing content collection schemas. One example may be a reference() function (see @tony-sull 's early experiment) to reference data collection entries by slug from your frontmatter.
This example allows you to list all blog post authors from each blog post:
src/content/config.ts
import { defineCollection, defineDataCollection, reference, z } from 'astro:content'
const blog = defineCollection({
schema: z.object({
title: z.string(),
authors: z.array(reference("authors")),
})
});
const authors = defineDataCollection({
schema: z.object({
name: z.string(),
avatar: image(),
})
})
export const collections { blog, authors };
Then, authors can be referenced by slug from each blog
entry's frontmatter. This should validate each slug and raise a readable error for invalid authors:
---
title: Astro 2.0 launched
authors:
- fred-schott
- ben-holmes
---
How do you differentiate between 1:1
and 1:N
? Is relation
always ["tony-sull"]
and never "tony-sull"
? It could be enough to implement 1:1
by doing relation("author").length(5)
but then you're always stuck with an array on both the frontmatter and the query side of things.
We should try to support 1:1
relations if we can, or if we can't make sure that's mentioned in the RFC.
This may even make sense to add as an explicit goal of the stage 2 proposal, since I'd argue the query DX hit of always needing to unwrap an array for 1:1
relation isn't acceptable.
src/content/config.ts
seems off, assuming a typo. Doesn't match format of https://docs.astro.build/en/guides/content-collections/#defining-a-collection-schema (maybe just missing defineCollection
)rel
or ref
instead of relation
. I'm usually not a fan of abbreviations but in this case I think its consistent with how minimal z
is. Also curious if "reference" makes more sense, espcially seeing that example where the default value is ["tony-sull"]
(which is literally a reference to the tony-sull
data object).relation("authors").default(["tony-sull"])
That default took a second for me to parse, but I think that makes sense (vs. not supporting it at all). How much of the Zod API does relation()
implement? Hopefully all of it?
- User-facing APIs to introduce new data collection formats like YAML or TOML. We recognize the value of community plugins to introduce new formats, and we will experiment with a pluggable API internally. Still, a finalized user-facing API will be considered out-of-scope.
Another reason to avoid this for now: if the whole reference system works by referencing an id
/slug
, then having a single large CSV doesn't guarantee a column as id/slug. We'd need some additional config to define which column is the primary key.
@FredKSchott Thanks for the suggestions! Think I agree with all of these. Thoughts on the API design:
ref
or reference
are better names. I lean towards reference
to avoid colliding with state management concepts from Vue et al. ...
// Reference a single author by id (default)
author: reference('authors'),
// Reference multiple authors in a list of IDs
authors: reference('authors', { relation: 'many' }),
src/data/
directory. This lets us play with ideas like removing the render()
and slug
conventions without much effort.relation("authors").default(["tony-sull"]) That default took a second for me to parse, but I think that makes sense (vs. not supporting it at all). How much of the Zod API does relation() implement? Hopefully all of it?
Looking into this, I don't think we can support default()
and optional()
chaining in this way. Under-the-hood, reference()
would transform the input ID to the actual data that ID points to:
export function reference(id) {
const dataModule = import(resolveIdToDataImport(id));
return z.object(dataModule);
}
This means helpers like .transform()
and .refine()
can work as expected, in case you want to massage data further (see the image().refine(...)
example for our experimental assets). But since data is already resolved, default
in particular wouldn't work (optional()
might). I think each of these should be parameters instead, if supported:
...
authors: z.reference('authors', { default: 'ben-holmes' });
I'd definitely prefer the z.array(reference())
syntax if there aren't too many tradeoffs
For me I can't think of many uses for .default()
or .transform()
inside of z.array()
, if that's the main tradeoff. I might want to default the array itself, but I'm not sure if .default()
would ever run on an individual array item. For .transform()
, would there be any performance improvement trying to transform each array item individually vs. transforming the full array once it resolves?
Thanks @tony-sull, I agree that's intuitive! I'm wrestling with whether reference()
should A) transform IDs to your data directly with a Zod transform, or B) avoid the transform and return some flag to tell Astro "hey! Post-process this Zod object key please!". These are the tradeoffs in what Zod extension functions we could support:
// Solution a)
author: reference('authors').transform(data => ...), // ✅ works
author: reference('authors').refine(data => ...), // ✅ works
author: reference('authors').default('ben-holmes') // ❌ Doesn't work. Data already resolved!
authors: z.array(reference('authors')) // ❌ Doesn't work. Data already resolved!
// Solution b)
author: reference('authors').transform(data => ...), // ❌ Doesn't work. Data not resolved yet!
author: reference('authors').refine(data => ...), // ❌ Doesn't work. Data not resolved yet!
author: reference('authors').default('ben-holmes') // ✅ works
authors: z.array(reference('authors')) // ✅ works
If we want to have Zod both ways, we need to add configuration options for the functions we don't support.
Ex. if we support .transform()
and .refine()
, we'd need the following for array
and default
:
authors: reference('authors', { array: true, default: 'tony-sull' }),
From what I've seen, I expect users to lean on array
and default
more than transform
and refine
. So I'm starting to agree that is the better way to go 👍
@tony-sull Just marked my comment above as outdated because... I'm wrong! Zod transforms run separately, so you can totally do z.array(...)
around a transformer and have it still work. Now I'm 100% on-board with your suggestion 👍
@bholmesdev Excellent! I wasn't actually sure if that setup would work, glad that does the trick! 🚀
There have been a few mentions of support a single file to store all data collection entries as an array, instead of splitting up entries per-file as we do content collections today. This would mean, say, a single src/content/authors.json
file instead of a few src/content/authors/[name].json
files.
Investigating this, I think it's best to stick with multiple files instead. Reasons:
en.json | fr.json | jp.json ...
for this use case.The reasons in favor of [collection].json
:
src/content/
folder. You can see which collections have data vs. which have content at a glance without opening the config file, which is convenient. The alternative for file-based identification when using file entries would be a src/data/
folder.I'd definitely lean towards multiple files for data collections, at least for the use cases I can think of. CSVs are an interesting one, but I could even see wanting multiple CSV files for something like a "database" of transactions grouped by month
src/content
vs src/data
is one I could definitely go either way on! Both have pros and cons, a couple ideas I thought about while listening to the community call today:
the main difference is a content collection entry gets a .render()
function to generate HTML. Two separate directories may help make that distinction more clear
On the other hand, using one src/content
folder would make an upgrade path easier if need to go from src/data/authors/ben.json
to src/content/authors/ben.md
to add something like content for an About page
I am working on a template for a talk I'll be giving at soon-to-be-announced conference, where I want to show people how they can setup an engineering blog for themselves or a multi-author blog for their company.
A feature like relational data from JSON would be really neat to have! In the meantime I manually hooked it all up based on a string ID.
A few notes on the earlier conversation here:
reference()
makes sense to mesrc/content
and src/data
makes sense to meJust an idea I'd like to throw in here: what if you could do getCollection('blog')
for the current behavior, and getCollection({ name: 'author', type: 'json' })
for the data collection? I haven't looked at the internals of getCollection
much but from my perspective as mostly an Astro user it makes sense, and it would open up possibilities for other type
options without adding even more getThingCollection()
functions and having to import those.
Ah, that's great to hear @EddyVinck! I'm working to have a preview release by end-of-day tomorrow. I'll share that branch here once it's up.
Just an idea I'd like to throw in here: what if you could do getCollection('blog') for the current behavior, and getCollection({ name: 'author', type: 'json' }) for the data collection?
This is an interesting idea! Though I will admit, I'm not sure if type
should be tied to file extensions. The goal is to separate based on shape of the response (note content
and data
have different return shapes), and each file extension of a given type adheres to that shape. In other words, file extension shouldn't matter when you're querying; just the shape of the response. So far I've considered 3 possible types:
content
- Markdown, MDX, and Markdoc as we have today. These feature both data and a render-able body.data
- JSON, YAML, CSVs, and other data types. These feature data alone, without a render-able body.page
- Content intended to be used as pages. These feature data and a render-able body, along with properties for mapping URLs, like a permalink
.I also worry that the type
API reads like a type cast, implying you could import a content collection as a data collection. Though I also see a parallel to import assertions which could be nice. Either way, since I don't see us having more than the 3 shapes outlined above, I think getDataCollection()
is a compromise to avoid breaking changes. We'll think over it though!
Well I'm a man of my word! Here's a preview branch + video overview of the new data collection APIs. Still waiting on the preview release CI action (something's holding it up...) but you should be able to clone the repo and try the examples/with-data
starter 🚀
Looks like the link got cut off, here's the actual PR: https://github.com/withastro/astro/pull/6850
src/data/
vs. src/content/
We've considered two options for storying data collections: using the same src/content/
directory we have today, or introducing a new, separate src/data/
directory.
src/data/
_data/
convention).body
and a render()
utility, while data does not). This makes data-specific APIs like getDataEntryById()
easier to conceptualize.src/content/
, but using the directory to define the collection type creates a pit of success.src/content/
content/
directorysrc/content/
, so I'll add my new "data" collection in this directory too. From user testing with the core team, this expectation arose a few times.authors
data collection, so you move json -> md
while retaining your schema.src/data/
, requiring the config
to live in a src/content/config.ts
is confusing. This is amplified when you do not have any content collections.There are compelling pros on both sides. Today, we have a deploy preview using src/data/
to get user feedback before making a final decision. Though based on the API bash with our core team and feedback below, using src/content/
for everything could be the more intuitive API.
This one's really minor, but I also like that reusing src/content/
means Astro isn't claiming another special directory in src
That means one less major
breaking change and less chance of getting in the user's way if someone wants their own src/data
directory
Using src/content
also allows for colocating your data with your content, where you could even put frontmatter you might end up putting in a markdown file in a separate json data file
@jasikpark So as part of this, I don't expect data and content to be able to live in the same collection. We'd require users to specify the type of a collection in their config file (i.e. defineCollection({ type: 'data' ... })
). This is arguably a point against supporting everything in src/content/
since it may encourage users to try this pattern when it is not supported.
ohhhhhh i forgot that
my-post/
content.mdoc
post-image.jpeg
post-image2.webp
data.json
wouldn't be supported...
hmm i dunno how i feel about either directory then 🙈
@jasikpark Well that would be supported still, as long as you add an underscore _
to the file name to mark as ignored in our type checker. We won't run mixed content and data through the same Zod schema. So colocation is fine, but mixed validation is not
Ok - I guess I've been thinking of a collection entry as a folder rather than a markdoc file 😓 good to understand that better
@jasikpark That's a model keystatic has adopted actually! Since nested directories are used for slug
/ id
generation, we haven't used this same model. It almost reminds me of the NextJS app/
directory vs. our current routing story.
cool, i'll play around w/ that then - thx for all the responses 💜
how does it make you think of the app folder for nextjs?
@jasikpark Well, it's the difference between file vs. directory-based routing. I think content collections and Astro's pages/
router have a lot of parallels, including the underscore _
for colocation. The other solution is to use wrapper directories for everything, where key files have a special name (like page.tsx
or content.mdoc
) with colocation allowed for any other files. I've heard thoughts on supporting both, which is interesting!
ooohhhh thx for clarifying, that's interesting yeah
I'm more in favour of src/content
, mostly because reserving another directory feels too much. This doesn't mean that we can't have an src/content/data
folder, although it might NOT make sense because they are two different concepts.
Thanks for the input y'all! Implemented src/content/
on the latest PR. Stage 3 RFC to come
Closing as this is completed and in stable.
Details
Summary
Introduce a standard to store data separately from your content (ex. JSON files), with a way to "reference" this data from existing content collections.
Background & Motivation
Content collections are restricted to supporting
.md
,.mdx
, and.mdoc
files. This is limiting for other forms of data you may need to store, namely raw data formats like JSON.Taking a blog post as the example, there will likely be author information thats reused across multiple blog posts. To standardize updates when, say, updating an author's profile picture, it's best to store authors in a separate data entry, with an API to reference this data from any blog post by ID.
The content collections API was built generically to support this future, choosing format-agnostic naming like
data
instead offrontmatter
andbody
instead ofrawContent
. Because of this, expanding support to new data formats without API changes is a natural progression.Use cases
We have a few use cases in mind considering data collections and data references. We expect this list to grow through the RFC discussion and learning from our community!
i18n/
collection containingen.json
,fr.json
, etc.alt
text or image widths and heights for standard assets. For example, animages/banner.json
file containing thesrc
as a string, alt text, and a preferredwidth
Goals
src/data/
directory distinct fromsrc/content/
, or simply allow data collections withinsrc/content/
.Non-goals