withastro / astro

The web framework for content-driven websites. ⭐️ Star to support our work!
https://astro.build
Other
46.3k stars 2.45k forks source link

Sitemap filter does not actually filter pages #7256

Closed AkashRajpurohit closed 1 year ago

AkashRajpurohit commented 1 year ago

What version of astro are you using?

2.5.6

Are you using an SSR adapter? If so, which one?

Cloudflare

What package manager are you using?

pnpm

What operating system are you using?

Mac

What browser are you using?

Firefox

Describe the Bug

Sitemap filter option does not actually filter based on the callback provided. The repro shared is simple astro blog example where filter option is applied for not including /api/ routes however generated sitemap still has the routes /api/test

Link to Minimal Reproducible Example

https://stackblitz.com/edit/astro-sitemap-repro

Participation

SerekKiri commented 1 year ago

Confirm, I can reproduce the issue. One solution might be to use serialize if you need to solve it right away.

integrations: [mdx(), sitemap({
  serialize(item) {
    if (item.url.includes('/api/')) {
    return undefined;
    }
    return item;
  },
})],
andremralves commented 1 year ago

Probably the problem is in this line:

https://github.com/withastro/astro/blob/c86f0c6e3e10efa13bec43969a275d82fb10da15/packages/integrations/sitemap/src/index.ts#L130

I will work on a solution.

xirkus commented 1 month ago

This still doesn't work for me regardless of the number of filter arguments provided (1..N). When I tested with the filter matching the complete site URL, it still generated a full list of pages.

Astro Version: 4.14.4 Site Map Version: 3.1.6

Example:

site: 'https://site.url', integrations: [tailwind(), sitemap({ filter: (page) => page !== 'https://site.url/', }), mdx(), ]

This will generate a full set of s for the site.

rnwolf commented 4 weeks ago

Hi @xirkus

The following works for me:

  integrations: [
    react(),
    sitemap({
      filter: (page) =>
        page !== 'https://www.example.com/contact_problem' &&
        page !== 'https://www.example.com/test-a' &&
        page !== 'https://www.example.com/test-b' &&
        page !== 'https://www.example.com/elements' &&
        page !== 'https://www.example.com/contact_success',
    }),
    tailwind({

with "astro": "^4.11.3", and "@astrojs/sitemap": "^3.1.6",

rnwolf commented 4 weeks ago

Ok based on the astro docs I have made the following changes to astro.config.mjs

  integrations: [
    react(),
    sitemap({
        serialize(item) {
          if (/contact_.*[a-z]|test-[a-z]|elements/.test(item.url)) {  // Update this to exclude more pages from site-map
            return undefined;
          }
          // Make sure that any blog posts with todays date in url and the blog index page have a lastmod date
          let dateString = `${new Date().toLocaleString("en-CA", { timeZone: "Europe/London" }).slice(0, 10)}.*|blog`;
          if (new RegExp(dateString, 'i').test(item.url)) {
            item.changefreq = 'daily';
            item.lastmod = new Date();
            item.priority = 0.9;
          }
          return item;
        },
    }),

The idea is that when I commit a blog post I prefix filename with todays date in the format yyyy-mm-dd_some_slug.mdx. This then results in the sitemap having a lastmod value for this blog post. Hopefully Google with then index this page sooner than it would otherwise.