Dealing with page ID stability: being robust when renaming page titles

feliksik commented 9 months ago

As the sourcetext files do not have a confluence ID, the identity is matched based on the page title. A page title is unique in a confluence space.

But with the title as ID, a page is rename it leads to the page being deleted, and a new page is created. As a consequence, incoming links break 🙁 (i.e. links from other confluence pages/spaces not managed by the same text2confl project). It would be much better to update the existing page instead.

We could use the filename (and optionally/additionally a self-made-up-identifier metadata field that can be even more stable, but has to be managed manually in the text file), and put this as metadata in the confluence page. When uploading this, we would read the metadata and match with the file, thus having a bridge between the confluence page id and the file on the asciidoc/md contents. This would solve the delete-recreate issue.

I suppose there are some details to work out, and alternative approaches to consider. I think this would be a very valuable feature for usage in a larger context, where the stability of incoming links is rather important. I'm happy to collaborate on making this work.

(Note: Initially I also mentioned the page stealing issue here, mentioned in #142, but I now think it deserves a separate solution).

feliksik commented 9 months ago

I have also thought about having a metadata confluence-page-id field that can be managed in the text file, instead of a self-made-up-identifier that needs to be added in the text file, and administered as metadata in Confluence.

Obviously this would only be known after creation in Confluence, so this needs to be added to the text later; either manually, or even by the tooling, inserting this in the text file as 1st line.

However, this does not seem like a good idea:

if adding the ID would be done by the tooling: it's probably a CI/CD pipeline running the deployment, so not a great moment to do new git commits.
I can imagine a workflow where you deploy the docs to MyProductionSpace in CI/CD, but I develop the documentation before PR/merge in MyTestSpace. A self-made-up-identifier could be unique per space, but be reused in both the Test and the Production space. On the contrary, confluence-page-id would break this workflow of having 2 deployments of the same document (unless we make it more advanced/complicated).

feliksik commented 9 months ago

New idea: provide an option --follow-git-renames. It will use something like git log --follow --diff-filter=A -- possibly-renamed.md to determine the hash commit of where a file was introduced, and uses the ${commitHash}-${originalFilename} as the identifier of the document, in the confluence page metadata.

I think this will achieve exactly what I intend:

renaming the file will keep the same page id
changing the title will keep the same page id
no manual effort needed to maintain an self-made-up-id in the document metadata

@zeldigas I'm not sure how much time you spend on this project, but I may get to implementing this myself when time permits. Either way, it's useful to first align on this idea.

zeldigas commented 7 months ago

@feliksik I believe that file rename is not an issue at all for page renames when you have explicit page titles - any sort of cleanup is done after all doc tree is processed, so even if file name was renamed or even moved under another location it will be processed properly

But for title renames it's challenging. You mentioned some metadata that can be associated with the page. While this can be set for sure, the main challenge would be to find this page - I did not dig deep into it, but I doubt that it's available out of the box if even possible. I see some docs, that is applied only to server version and requires server setup configuration: https://developer.atlassian.com/server/confluence/content-properties-in-the-rest-api/. And iterating over all the pages might be not a good idea at all.

That said, this idea need some research and I really appreciate your help here, as I'm not sure that withing reasonable time I'll be able to research this on my own.

Probably it's worth starting with research - if it's possible to search for page by some metadata

Another thoughts that I have - with additional constraints it might be possible to do without this search, but also with additional load on confluence: as we know parent page, we can try to fetch information about all child pages (recuresively) and use this page tree to search for renamed pages - either based on file name or based on this hash that you mentinoned

feliksik commented 7 months ago

You are spot on in your analysis. This is not my highest priority, but I'll keep you posted when I make any progress.

feliksik commented 4 months ago

I have taken a further look at the code-base, reporting for my own recollection and yours;

I think the following would be possible:

we modify the page conversion such that it creates a val pagesToPublish : List<Page> in such a way that all Pages have some page header attribute with the GIT_BASED_STABLE_PAGE_ID (e.g. like described here).
we make a list parentList of all the page ID's that may contain renamed pages as children, and for which we have an ID already:
- the default-parent-id (unmanaged document)
- the parent id's mentioned as a page attributes (may be unmanaged documents)
for each page in parentList, we recursively query the child pages using the ConfluenceClient#findChildPages() similar to here, but also get the property GIT_BASED_STABLE_PAGE_ID. These are all collected as serverPages : List<ServerPage>. But note that (if for efficiency we don't want to query a huge space), it's ok to only recursively keep querying the children for pages that are either in parentList, or are owned by our tenant; this is similar to the current behavior, as when I currently move a page outside of the default-parent-id tree and also rename it, it is currently not removed as orphan, either.
the serverPages now contain all the page ID's for the relevant pages in pagesToPublish.
for each page in pagesToPublish, find it's ID in serverPages, by first comparing the GIT_BASED_STABLE_PAGE_ID (if available in the serverPage) and otherwise the page title. This gives the page ID's for most pages, but not for those that are to be newly created, or those pages that were somehow moved outside of the 'default-parent-id' on the server (as those are not found as serverPages, for the reasons of efficiency mentioned above).
for each page in pagesToPublish, do createOrUpdatePageContent -- but with a slightly modified findPageOnServer: it should get the full details/content using the pageId if available, but if it's not, it should use the title; (as I understand from just like in the current behavior, a page with the given title may have moved outside of the regular parent-id tree, in which case it must be an UPDATE; and if it doesn't, it's a CREATE). The update operation will initialize (but never change) the GIT_BASED_STABLE_PAGE_ID for the page, if it wasn't already set. I see the update already does adjustTitleIfRequired (even though I'm not sure how this could happen, with my understanding of the current implementation of findPageOnServer)
the orphan removal now does not need querying of the page children anymore, as the serverPages already contains those pages, and the orphans can be found there. This means I actually don't think there will be so much more queries than in the current implementation -- I think there will only be extra calls to findChildPages() for the pages in parentList.

I think this should work, but it would require some refactoring. Whether this is worth it depends on how valuable you find this feature, and whether you're ok with having the logic adapted accordingly.

But I feel it would be a great feature; especially since Confluence Cloud uses page ID's so dominantly in the URL's, that text2confl page renames break the URL, but regular WYSIWYG users don't have such problem.

What do you think?

zeldigas / text2confl

Dealing with page ID stability: being robust when renaming page titles #144