Sitemap Scope: The Silent Index Killer

April 16, 20266 min read

Google discovered 273 of 12,936 URLs from my sitemap. Search Console said 'Success.' The problem wasn't content quality or backlinks — it was a directory path.

273 out of 12,936

One month after submitting my sitemap to Google Search Console, the coverage report said this:

Sitemap: https://sixlines.online/sitemap.xml
Status: Success
Discovered pages: 273

Success. Green checkmark. 273 pages discovered.

The site has 12,936 URLs — 64 hexagram pages, 4,096 transition pages, 65 blog articles, plus glossary and comparison pages, all multiplied across three locales. The sitemap index pointed to 65 sub-sitemaps. Google read the index, said "Success," and ignored 97.9% of the URLs.

I spent a month assuming the problem was domain authority. New site, few backlinks, Google being cautious. The usual advice: build links, submit URLs manually, wait. Standard "Discovered — currently not indexed" troubleshooting.

The problem was a directory path.

The scope rule nobody mentions

The Sitemap Protocol has a scope rule. It's one sentence in Google's documentation, buried in the "Build a sitemap" page — not in the "Large sitemaps" page where you'd actually look when splitting sitemaps:

A Sitemap file located at http://example.com/catalog/sitemap.xml can include any URLs starting with http://example.com/catalog/ but cannot include URLs starting with http://example.com/images/.

A sitemap can only reference URLs in the same directory or below. This is a hard protocol constraint. Google enforces it silently — no error, no warning, just quiet rejection.

My sub-sitemaps were served from:

/sitemaps/0
/sitemaps/1
...
/sitemaps/64

Their parent directory is /sitemaps/. Every URL they contained looked like:

https://sixlines.online/en/hexagrams/1
https://sixlines.online/zh-CN/blog/taichu-calendar-reform
https://sixlines.online/zh-TW/glossary

All under /en/, /zh-CN/, /zh-TW/. All outside the /sitemaps/ scope. Google correctly — and silently — ignored them.

How I got here

This wasn't carelessness. It was three systems conspiring against me.

Next.js pushed me toward nested paths. The framework's built-in generateSitemaps() is broken with [locale] dynamic routes in Next.js 16 — the auto-generated sitemap index gets caught by the locale segment and crashes. I'd already documented this in ADR-010 and built a workaround: custom route handlers at /sitemaps/[id]. A reasonable path. A natural directory structure. And a protocol violation.

Google Search Console gave false confidence. The sitemap index at /sitemap.xml — which lives at the root and was submitted via GSC — showed "Success." Submitting a sitemap via GSC relaxes the scope rule for that file. But not for the sub-sitemaps it references. The index was fine. The sub-sitemaps it pointed to were silently scoped to a directory that contained none of my actual pages.

Every SEO resource I consulted diagnosed the wrong problem. Searching "sitemap not being indexed" returns pages about content quality, crawl budget, E-E-A-T signals, backlink profiles, "Discovered — currently not indexed" fixes. All assuming Google successfully read the sitemap. Nobody asks "is your sitemap at a path that's allowed to reference those URLs?" because if you use a flat /sitemap.xml at root, you never encounter the scope rule. It only surfaces when you split into sub-sitemaps at a subdirectory path.

Why 273?

The core sitemap (/sitemaps/0) contains 648 URLs — static pages, hexagram detail pages, blog articles, glossary entries. All out of scope. But Google discovered 273 of them anyway.

My best explanation: because the parent sitemap index was submitted directly via GSC (which relaxes scope for that file), Google partially respected the sub-sitemap URLs through transitive leniency. It read sitemap 0 and indexed roughly the English-locale subset (648 URLs ÷ 3 locales ≈ 216, plus some cross-locale pages). The 64 transition sitemaps (12,288 URLs) were completely ignored.

The "Success" status was technically accurate — the sitemap index XML was valid. Google just didn't do anything useful with most of what it referenced.

The fix

Move sub-sitemaps to root-level paths. Instead of /sitemaps/0, serve them at /sitemap-0.xml. Parent directory: /. Scope: the entire site.

In Next.js, you can't create a dynamic route with a folder name like sitemap-[id].xml — the framework doesn't recognize [id] as a dynamic segment when it's embedded in a name with other characters. The build fails because params resolves to {} instead of { id: string }.

The workaround: a rewrite in next.config.ts.

async rewrites() {
  return [
    {
      source: "/sitemap-:id(\\d+).xml",
      destination: "/sitemaps/:id",
    },
  ];
},

The underlying route handler stays at /sitemaps/[id]. The rewrite exposes it at /sitemap-{id}.xml. Google sees a root-level resource. The sitemap index references the root-level paths. Scope satisfied.

Two other fixes went in alongside:

Stable lastmod timestamps. Every URL was getting new Date() at render time — each request producing a different timestamp. Google ignores lastmod when it's clearly not real. Replaced with a module-level BUILD_TIME constant: one timestamp per deploy, consistent across all entries. Blog articles keep their real dateModified values.

<lastmod> in the sitemap index. The index's <sitemap> entries had <loc> but no <lastmod>. Google deprioritizes sub-sitemaps without modification dates. Added the build timestamp to each entry.

What I should have checked first

After the fix deployed, I realized the diagnostic would have taken five minutes. Open any sub-sitemap URL. Look at the path. Ask: "Can a file at this path reference URLs at the root?"

Sitemap at: /sitemaps/0
Contains:   /en/hexagrams/1

Is /en/hexagrams/1 under /sitemaps/?  No.

That's the entire debugging session. But I never thought to ask the question because every signal — GSC status, sitemap validation tools, SEO audit results — said the sitemap was fine. The protocol violation is invisible to every tool except Google's own crawler, and Google doesn't tell you about it.

The broader pattern

This is the same class of bug I found with Pagefind on 8-Bit Oracle. In both cases:

  • The site looked correct to humans using a browser
  • Every automated validation tool said everything was fine
  • The actual consumer of the output (Google's crawler, Pagefind's indexer) silently ignored the content
  • The failure mode was quiet omission, not an error

With Pagefind, the fix was making SSG output contain actual HTML instead of empty divs. With sitemaps, the fix was serving them from a path that's allowed to reference the URLs they contain. Both bugs persisted for weeks because the feedback loop — submit sitemap, wait, check coverage — is slow enough that you blame other things first.

If you're splitting sitemaps into sub-files, check where they're served from. If the path isn't at or above every URL they reference, Google will ignore them. And it will tell you everything is fine while doing so.

Update: rewrites don't work either

The rewrite fix worked on sixlines.online — Google eventually re-crawled and discovered the URLs. But when I applied the same approach to 8bitoracle.ai (the brand site), it failed completely.

Google Search Console said "Sitemap index processed successfully" — it could read the index at /sitemap.xml. But every child sitemap showed "Couldn't fetch" with 0 discovered URLs. All 65 sub-sitemaps, invisible.

From curl, everything looked perfect:

$ curl -s -o /dev/null -w "%{http_code}" https://8bitoracle.ai/sitemap-0.xml
200
$ curl -s -D - https://8bitoracle.ai/sitemap-0.xml | head -5
HTTP/2 200
content-type: application/xml
x-matched-path: /sitemaps/0
x-nextjs-rewritten-path: /sitemaps/0

Valid XML. Correct Content-Type. HTTP 200. Even with User-Agent: Googlebot. But Google's sitemap crawler couldn't fetch any of them.

This is Next.js issue #75836. Vercel rewrites are supposed to be transparent — they're server-side URL mappings, not redirects. The client should never know the difference. But Google's sitemap fetcher does know, or at least behaves as if it does.

I don't have a confirmed explanation for why. The Vary: rsc, next-router-state-tree, next-router-prefetch header on the response is suspicious — it tells caches to serve different content based on Next.js-specific headers that Googlebot wouldn't send. But that should result in Google getting the default (non-RSC) response, which is the correct XML. The failure is empirical, not theoretical.

The actual fix

Stop using rewrites. Generate static XML files.

// scripts/generate-sitemaps.ts — runs before `next build`
const ids = getSitemapIds(); // [0, 1, 2, ..., 64]

fs.writeFileSync('public/sitemap.xml', sitemapIndexXml(ids));

for (const id of ids) {
  const entries = generateSitemapEntries(id);
  fs.writeFileSync(`public/sitemap-${id}.xml`, entriesToXml(entries));
}

Add to package.json:

"prebuild": "tsx scripts/generate-sitemaps.ts",
"build": "next build",

Static files in public/ are served directly by Vercel with zero processing — no rewrites, no route handlers, no edge functions, no Vary headers. Google gets a plain file.

After deploying the static file approach on 8bitoracle.ai:

Sitemap: https://8bitoracle.ai/sitemap.xml
Status: Success
Discovered pages: 16,196

16,196 URLs discovered. Immediately. The rewrite approach had been live for weeks with zero.

The lesson, again

Same pattern as before: the fix that works in every testing scenario fails silently in production. curl says 200. The browser renders XML. Sitemap validators say it's valid. The only system that can't fetch it is the one that matters — Google's crawler — and it doesn't tell you why.

If you're serving sitemaps through Next.js/Vercel rewrites and GSC says "Couldn't fetch," try static files. The 30-line prebuild script is less elegant than a rewrite, but it works.


The middleware was already correct, for once. Its matcher excluded both sitemaps (literal) and *.xml (dot in filename), so the rewritten paths pass through without locale redirects. Sometimes you get lucky.

This post was written with the assistance of Claude (Anthropic). The author provided editorial direction, project context, and fact-checked all claims. The AI assisted with drafting and research.
Augustin Chan is CTO & Founder of Digital Rain Technologies, building production AI systems including 8-Bit Oracle. Previously Development Architect at Informatica and Senior Consultant at Dun & Bradstreet. BS Cognitive Science (Computation), UC San Diego.