Exclude heading from toc globally

tperry-r7 commented 4 years ago

What is the problem?

I want to exclude the heading "# On this page" from my workspace toc globally. I reviewed the instructions and it seems it might be possible. I'm not able to figure out the right syntax.

How can I reproduce it?

"markdown.extension.toc.omittedFromToc": {
    "/Users/{users}/Documents/{documentation-folder}":[
        "# On this page"
    ]
},

Is there any error message in the console?

Proposals

[ ] Add wildcard support to the omittedFromToc configuration (both file names and headings #1388).
[ ] Add an option to exclude headings before the position of the TOC.

yzhang-gh commented 4 years ago

You need to specify the file name but not the folder.

If your documents have a similar structure, consider to set toc.levels at workspace level.

tperry-r7 commented 4 years ago

Hmm, not quite what I am looking for.

I'm using this as part of managing a large amount of documentation.

A typical file will look like below, with each file having a different name. I don't want to have to set the ignore per file. Is there a way to set it at the workspace level so I don't need to add every single file name?

# Title

## On this page

* [Toc](link) 
* [Toc](link) 
* [Toc](link) 

## Heading
some text

## Heading
some text

### Heading 
some text

yzhang-gh commented 4 years ago

Currently, we don't provide such a configuration. As a workaround, you can use ### On this page and set toc.levels to 2..6. (The TOC will start with the first level-2 heading.)

Thell commented 4 years ago

Since I am experiencing a similar desire to omit particular headings perhaps either the heading existing prior to the TOC could be omitted (by a flag) or a distinction could be made between setext and ATX heading styles where one type could be omitted.

The argument for the first option would be a simple question of how many times have you experienced a published work where the TOC includes the cover/title, preamble, summary, etc? Those typically reside before the TOC and aren't included.

The argument for the second option would be the intent of the usage.

Looking at the following outline which items would most likely be found in a ToC?

Title
===
Purpose text.

## Introduction
Summary text/info blurb.

Contents
----------
[ToC]

# Header

## Header

### Header

# Header

## Header

### Header

I haven't looked at the source yet, as I just got started with your extension less than an hour ago and ran into this problem and this was the first issue in the tracker but I wanted to give my feedback on the first feature of your extension that I attempted to make use of.

Perhaps it would be an easy alteration to the code?

edit: Upon re-reading the original post I'm thinking that perhaps this isn't the right spot for these comments?

Thell commented 4 years ago

Looks like you are already special handling setext headings making for a possible opportunity to alter the if block:

https://github.com/yzhang-gh/vscode-markdown/blob/bbada83b39a6ef862e59da0696847d953757ee23/src/toc.ts#L351

to include something along the lines of && !Omit.setextStyle ?

yzhang-gh commented 4 years ago

to omit particular headings perhaps either the heading existing prior to the TOC could be omitted (by a flag) or a distinction could be made between setext and ATX heading styles where one type could be omitted.

Personally, I don't think a distinction between setext and ATX headings is a good solution (in terms of how many users will use this special rule). To omit headings before [TOC] makes more sense to me.

Upon re-reading the original post I'm thinking that perhaps this isn't the right spot for these comments?

This issue is about the toc.omittedFromToc (global) option while you are talking about omitting headings in general. They are different things but are both solutions to a general question.

BTW, I understand people may have different ideas about how to organize their Markdown file structures, e.g. blog posts. Just curious why you starts with level-1 heading when coming to the "Contents". You are putting them at the same level as "Title" (in other words, they have visually the same font size/weight etc.).

(See the README source file of this repository for a different section structure.)

Thell commented 4 years ago

To omit headings before [TOC] makes more sense to me.

Indeed and I agree; although I can see a case for being able to treat the two (settext and ATX) distinctly particularly since they are distinct visual indications when viewing a markdown document in its native format as a text document.. Just having the ability to omit Pre TOC headers would be a 👍

Regarding the example: it was meant more for illustration as to what one would anticipate seeing in a ToC given that textual description. One thing I try to avoid in markdown documents is markup. To me  or raw HTML(5) tags really hinder reading the document itself without a browser even though it makes sense for a README that is intended to be rendered on github and not in a console.

yzhang-gh commented 4 years ago

I see your point. Let's just collect more feedback and then see how we can improve it.

hpractv commented 3 years ago

I think the easiest change to understand would be to make the path wildcard enabled:

"markdown.extension.toc.omittedFromToc": {
    "*/*.md": [
        "# index"
    ]
},

Also, as an aside, it would be nice if the string being ignored could be done in a case insensitive way. Otherwise, you'd have to add "# Index" and "# INDEX" and hope that "iNdex" isn't used.

xAlien95 commented 3 years ago

To omit headings before [TOC] makes more sense to me.

@yzhang-gh, this is exactly what I'm looking for. Is it currently available as feature?

yzhang-gh commented 3 years ago

@hpractv @xAlien95 Thanks for the feedback. I've updated the issue description.

IMHO, they are both valid feature requests/enhancements. Let's see whether there is someone willing to help.

davorpa commented 3 years ago

I think the easiest change to understand would be to make the path wildcard enabled:
"markdown.extension.toc.omittedFromToc": {
    "*/*.md": [
        "# index"
    ]
},
Also, as an aside, it would be nice if the string being ignored could be done in a case insensitive way. Otherwise, you'd have to add "# Index" and "# INDEX" and hope that "iNdex" isn't used.

I agree. Wonderful idea using insensitive glob instead of exact file match. It would be great for both, file and headings. All files:

  "markdown.extension.toc.omittedFromToc": {
    "**": ["# Index", "## Index", "### Index"]
  },

At code:

https://github.com/yzhang-gh/vscode-markdown/blob/65f0fa38d0cdf0b8b67cc461ee444d4c91143df9/src/toc.ts#L221-L233

innocenzi commented 2 years ago

I think the easiest change to understand would be to make the path wildcard enabled:
"markdown.extension.toc.omittedFromToc": {
    "*/*.md": [
        "# index"
    ]
},
Also, as an aside, it would be nice if the string being ignored could be done in a case insensitive way. Otherwise, you'd have to add "# Index" and "# INDEX" and hope that "iNdex" isn't used.

Looks exactly like what I need.

Lemmingh commented 2 years ago

This thread has two distinct topics now. Be careful! Both are indeed very complex.

Let's evaluate them one by one.

Allow wildcard in the "exclude-heading" setting

Anyway, the toc.omittedFromToc setting (#580) is severely flawed, and cannot work on virtual file systems. Thus, we have to remove it someday, and get a totally new one.

If I were able to design the setting from scratch, I would also add wildcard support to the path filter.

But I strongly disagree with glob patterns. Glob has a long history, and appears nice at first glance. However, it has too many variants, and cannot briefly express fine-grained control. If you are still not aware of the scary headache, take a look at the configuration files inside this repository, and you can find at least five different glob systems. They differ slightly, mainly in semantics.

I prefer something with a rigorous and evergreen spec, for example, the ECMAScript regular expression. Although ES regular expression is difficult to get started, it ensures a firm consensus over time.

Another interesting problem in this topic is how to match headings. I prefer ordinal comparison on the raw content for security and efficiency.

Why just ordinal comparison? In case you are not familiar with case-insensitive matching for Unicode strings, there are basically 3 algorithms (ASCII, Unicode full, language-sensitive), and usually 5 in practice (change folding sets). If we also take NF into consideration, 20 options. Evidently, compared with case-insensitive comparison, ordinal comparison is safe, reliable, and fast.

Why just raw content? CommonMark defines ATX headings and setext headings, and how to extract their raw content. Again, reliability. Checking the "rendered" text would be overkill and too slow.

interface _ {
    /**
     * The path filter.
     */
    path: string;

    /**
     * The heading level.
     */
    level: 1 | 2 | 3 | 4 | 5 | 6;

    /**
     * The raw content of the heading according to the CommonMark Spec.
     * Can be multiline.
     */
    content: string;
}

Exclude all headings before TOC

As I noted in #931, it is not possible with the current architecture.

The current system can be regarded as "natural recognition", where we detect TOCs by statistical characteristics, as described in the v1 README. Multiple TOCs are also allowed. For example, you may want to put one TOC at the beginning of the document, and the other at the end. (#360)

If we want to "exclude all headings before TOC", we'll have to introduce a "directive-based recognition" system to make the TOC range deterministic, and only allow single TOC meanwhile. This requires a major design, and I'm not going to have an in-depth discussion here.

Lemmingh commented 2 years ago

A sketch of the "exclude-heading" setting I talked above:

type HeadingLevel = 1 | 2 | 3 | 4 | 5 | 6;

/**
 * @param level - The heading level.
 * @param content - The raw content of the heading according to the CommonMark Spec. Can be multiline.
 */
type HeadingSelector = [level: HeadingLevel, content: string];

/**
 * The shape in `settings.json`.
 */
interface UserSetting {
    [path: string]: HeadingSelector[];
}

interface InternalSetting {
    /**
     * The path filter.
     */
    path: RegExp;

    selectors: HeadingSelector[];
}

/**
 * The setting is managed as a Map at runtime.
 */
type InternalRegistry = Map<string, InternalSetting>;

const config: UserSetting = {
    "^docs/demo/emoji-[^/]+\\.md$": [[2, "😜 TOC"]],
    "^docs/.*\\p{Script=Han}.*\\.md$": [[2, "目录"]],
};

const s: InternalRegistry = new Map<string, InternalSetting>(
    Array.from(Object.entries(config), ([p, selectors]): [string, InternalSetting] => {
        const path = new RegExp(p, "u");
        return [path.source, { path, selectors }];
    })
);

davorpa commented 2 years ago

A sketch of the "exclude-heading" setting I talked above:

type HeadingLevel = 1 | 2 | 3 | 4 | 5 | 6;

/**
 * @param level - The heading level.
 * @param content - The raw content of the heading according to the CommonMark Spec. Can be multiline.
 */
type HeadingSelector = [level: HeadingLevel, content: string];

/**
 * The shape in `settings.json`.
 */
interface UserSetting {
    [path: string]: HeadingSelector[];
}
...

Nice draft!!

Could be the text also an optional Regexp?

type RegExpOrString = RegExp | string;

type HeadingSelector = [level: HeadingLevel, content: RegExpOrString];

interface InternalSetting {
    path: RegExpOrString;
    selectors: HeadingSelector[];
}

Sometimes other engines are used and it have attributes or html formatting tags.

### <a name="heading-3-anchor-alias"></a>Heading 3th level{#heading-3 .red}

for parsing attributes syntax you could see... markdown-it-attrs

Also glob and regexp are not both exclusive. May be marked by an starting character and then handle parsing config using some GOF patterns: strategy factory, chain of responsibility...

E.g. In config... a string starting with ^ means RegExp, if not, might be a glob pattern or merelly string

Lemmingh commented 2 years ago

Could be the text also an optional RegExp?

E.g. In config... a string starting with ^ means RegExp, if not, might be a glob pattern or merely string

Sounds like Path and LiteralPath. But I think it's neither necessary nor practical.

Assuming that your workspace on Windows is

C:\Projects\docs

To exactly match the file

C:\Projects\docs\^! #$%&'()+,-;=@[]_`{}\~.md

You just need

let p = "^\\^! #\\$%&'\\(\\)\\+,-;=@\\[\\]_`\\{\\}/~\\.md$";

new RegExp(p, "u").test("^! #$%&'()+,-;=@[]_`{}/~.md"); // true

RegExp can also express those strange file names on Unix-like systems without too much effort.

Sometimes other engines are used and it have attributes or HTML formatting tags.

To get the "rendered" text is not just parsing. You have to really render it. Even WHATWG's innerText algorithm is too heavy.

So, I said "overkill".

Well, I'm waiting for someone to ask a question:

const config: UserSetting = {
    "^docs/demo/emoji-[^/]+\\.md$": [[2, "😜 TOC"]],
    "^docs/.*\\p{Script=Han}.*\\.md$": [[2, "目录"]],
};

What does this mean on earth? What if we have

docs/demo/emoji-大笑.md

Here is another design somewhat modeled after Vue Router. I like it a bit more because of its clean JSON schema, type definition, and logic.

type HeadingLevel = 1 | 2 | 3 | 4 | 5 | 6;

type HeadingSelector = [level: HeadingLevel, content: string];

interface MatchRule<TPath extends string | RegExp> {
    /**
     * The path filter. (regular expression)
     */
    path: TPath;

    headings: HeadingSelector[];
}

type UserRule = MatchRule<string>;

type InternalRule = MatchRule<RegExp>;

/**
 * The shape in `settings.json`.
 */
type UserSetting = UserRule[];

/**
 * The setting at runtime.
 */
type InternalSetting = InternalRule[];

const config: UserSetting = [
    { path: "^docs/demo/emoji-[^/]+\\.md$", headings: [[2, "😜 TOC"]] },
    { path: "^docs/.*\\p{Script=Han}.*\\.md$", headings: [[2, "目录"]] },
];

const s: InternalSetting = Array.from(
    config,
    ({ path: p, headings }): InternalRule => ({ path: new RegExp(p, "u"), headings })
);

/**
 * Normalized relative path.
 */
declare const docPath: string;

for (const rule of s) {
    if (rule.path.test(docPath)) {
        // ...
        break; // Only the first matching rule takes effect.
    }
}

davorpa commented 2 years ago

const s: InternalSetting = Array.from(
    config,
    ({ path: p, headings }): InternalRule => ({ path: new RegExp(p, "u"), headings })
);

Vue Router seems more clear, I thought, using the Array.from's iterable mapper. Nice use to separate api and implemented config.

rishabhbhatt009 commented 1 year ago

As mentioned before there are 2 different topics mentioned in this thread :

Topic 1 : Allow wildcard in the exclude-heading (globally)
Topic 2 : Exclude all headings before TOC

However, I believe that there is only one topic as the other one may have been misinterpreted.

<!-- common .md structure -->
# Page Title 

## Table Of Content
- [Page Title](link)
   - [Table Of Content] (link)
   - [Heading-1] (link)
   - [Heading-1] (link)

## Heading-1 
. . .

## Heading-2 
. . .

There are 2 main problem with this TOC (under the default settings of "markdown.extension.toc.levels": "1..6") :

Inclusion of Page Title (Topic-2)
Inclusion of Table of Content in TOC (Topic-1)

I believe Topic-2 arises due to the inclusion of headings like # Page Title. Like @Lemmingh correctly pointed out :

you may want to put one TOC at the beginning of the document, and the other at the end.

In most cases like this people do not want to exclude all headings above TOC but higher level headings such as Page-Title. The intended purpose can easily be achieved by changing the settings to "markdown.extension.toc.levels": "2..6".

Topic-1 arises due to the inclusion of headings such as ## Table of Content itself. Which is a generic heading across all *.md files and must be omitted (at least within the project workspace once set) without specifying each filename. We require a solution on the lines of :

{
  "omittedInToc": {
    "*.md": [
      "## Table Of Content"
    ],
    "docs/my_doc.md": [
      "# Introduction",
      "## Summary"
    ]
  }

pasquale95 commented 7 months ago

Hi, has the solution with wildcards been implemented?

Handled1 commented 1 month ago

This would be a great feature, and something I'm having to handle now. Any idea if this will go ahead?

yzhang-gh / vscode-markdown