webmeshproj / webmesh

A simple, distributed, zero-configuration WireGuard mesh solution
https://webmeshproj.github.io
Apache License 2.0
425 stars 16 forks source link

mtls: Global options are not properly writing all the needed TLS configurations #17

Closed 80imike closed 1 year ago

80imike commented 1 year ago

After I configured it according to this document https://webmeshproj.github.io/guides/personal-vpn/, when starting the webmesh node, it will report the following error:

$ /root/tools/webmesh-node --global.disable-ipv6 \ --global.detect-endpoints \ --global.detect-private-endpoints \ --global.mtls \ --global.tls-ca-file=$(pwd)/pki/nodes/bootstrap/ca.crt \ --global.tls-cert-file=$(pwd)/pki/nodes/bootstrap/tls.crt \ --global.tls-key-file=$(pwd)/pki/nodes/bootstrap/tls.key \ --global.verify-chain-only \ --bootstrap.enabled Error: invalid global options: mtls is enabled but no tls-ca-file is set

$ tree . └── pki ├── ca │   ├── ca.crt │   ├── tls.crt │   └── tls.key └── nodes ├── admin │   ├── ca.crt │   ├── tls.crt │   └── tls.key ├── bootstrap │   ├── ca.crt │   ├── tls.crt │   └── tls.key └── node ├── ca.crt ├── tls.crt └── tls.key

However, --global.tls-ca-file=$(pwd)/pki/nodes/bootstrap/ca.crt is set correctly and the file exists. Why does this error still appear?

/root/tools/webmesh-node --version

Webmesh Node Version: 0.11.4 Commit: a40a85bd7dfd01e734f4c286aa068fad0b1ae485 Build Date: 2023-10-07T07:26:16Z

tinyzimmer commented 1 year ago

Wow good catch. I'm not sure how long that's been there, but I have a conditional wrong in the Global options validator. Will push a fix soon.

tinyzimmer commented 1 year ago

Once v0.11.5 is done building that should fix your issue. The problem is that it was wanting you to also pass --global.tls-client-ca-file unncessarily. You can confirm that on your current version by setting both flags to the same file.

Otherwise updating to the latest version should address the problem.

80imike commented 1 year ago

Wow good catch. I'm not sure how long that's been there, but I have a conditional wrong in the Global options validator. Will push a fix soon.

I have tried several versions, and all have this problem in 0.11.2-0.11.4

tinyzimmer commented 1 year ago

Yea it probably popped up after one of the minor bumps. Each one of those has involved a core-refactor in some way. In coming releases I intend to have full e2e tests that go through all the example topologies and things like TLS.

80imike commented 1 year ago

Yea it probably popped up after one of the minor bumps. Each one of those has involved a core-refactor in some way. In coming releases I intend to have full e2e tests that go through all the example topologies and things like TLS.

I just tried 0.11.5 and the same problem exists:

image image image
tinyzimmer commented 1 year ago

New error. Damn how long ago did I break this :laughing:

Will need to dig in.

tinyzimmer commented 1 year ago

I'm in the middle of reworking the storage API, but before I do the next release I'll make sure to have whatever this is resolved. Will keep this issue posted.

80imike commented 1 year ago

I'm in the middle of reworking the storage API, but before I do the next release I'll make sure to have whatever this is resolved. Will keep this issue posted.

I tested it and the last available version is 0.3.0

image
tinyzimmer commented 1 year ago

Wow I admire your commitment. And am also ashamed. What might still be working on the latest versions is manually supplying every required field. What happened around 0.3.0 is that I ripped out my custom flag/config/env parsing in favor of using koanf. This must have slipped through.

tinyzimmer commented 1 year ago

I got past a big refactor I was in the middle of and will be doing a new 0.12.0 probably at some point later today. I'll be making sure this is fixed as part of that release.

tinyzimmer commented 1 year ago

I believe I have this resolved, but I'm writing out a few more tests and then going to go through the guide myself with the latest build to double check before I tag a new release.

tinyzimmer commented 1 year ago

The coming v0.12.0 should have these issues all fixed with some more test cases around them. I'll leave the issue open for a few days to make sure.

80imike commented 1 year ago

I believe I have this resolved, but I'm writing out a few more tests and then going to go through the guide myself with the latest build to double check before I tag a new release.

The above problem should be solved, and the bootstrap node has started normally.

But when I use v0.12.0 version to connect to bootstrap node, I get the following error:

webmesh-node \
    --global.mtls \
    --global.tls-cert-file=/etc/webmesh/tls/tls.crt \
    --global.tls-key-file=/etc/webmesh/tls/tls.key \
    --global.tls-ca-file=/etc/webmesh/tls/ca.crt \
    --mesh.join-address=10.0.8.17:8443
time=2023-10-09T13:46:46.592+08:00 level=INFO msg="Starting webmesh node" version=0.12.0 commit=1583de5b73608fc93dfaee89bc483bcbbe97d37d buildDate=2023-10-08T21:21:00Z
time=2023-10-09T13:46:46.603+08:00 level=INFO msg="Joining webmesh cluster" component=mesh node-id=admin
time=2023-10-09T13:46:46.605+08:00 level=ERROR msg="Join request failed" component=mesh node-id=admin error="join: rpc error: code = Unavailable desc = connection error: desc = \"transport: authentication handshake failed: tls: failed to verify certificate: x509: cannot validate certificate for 10.0.8.17 because it doesn't contain any IP SANs\""

This seems to be caused by the fact that there is no corresponding IP in the certificate:

openssl x509 -noout -text -in /etc/webmesh/tls/tls.crt
Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number: 5169786674212558623 (0x47bec596e8fcd71f)
        Signature Algorithm: ecdsa-with-SHA256
        Issuer: CN = webmesh-ca
        Validity
            Not Before: Oct  8 03:50:36 2023 GMT
            Not After : Jan  6 03:50:36 2024 GMT
        Subject: CN = admin
        Subject Public Key Info:
            Public Key Algorithm: id-ecPublicKey
                Public-Key: (256 bit)
                pub:
                    04:4b:96:93:62:22:55:a1:d9:53:4d:cc:0f:58:e6:
                    d0:2d:a1:b8:e9:38:d5:b5:0f:6f:1f:a9:3b:74:bb:
                    dc:0a:d7:25:bd:64:32:9a:bd:3f:c7:f2:39:41:b3:
                    53:5b:31:9c:0e:19:48:8a:93:e2:90:21:44:71:b9:
                    97:94:81:6b:48
                ASN1 OID: prime256v1
                NIST CURVE: P-256
        X509v3 extensions:
            X509v3 Key Usage: critical
                Digital Signature
            X509v3 Extended Key Usage:
                TLS Web Client Authentication, TLS Web Server Authentication
            X509v3 Basic Constraints: critical
                CA:FALSE
    Signature Algorithm: ecdsa-with-SHA256
         30:45:02:20:02:19:47:9f:06:2c:e1:e2:72:dd:20:07:d3:49:
         59:4a:9f:f6:24:e7:19:72:d6:39:f7:18:0e:41:96:9b:2c:68:
         02:21:00:e4:42:16:65:5c:40:ae:d6:5a:90:85:74:b8:fa:68:
         aa:37:f6:b8:a4:28:48:5f:7f:0a:b7:d7:d8:74:1b:16:42

I changed to using the domain name to make requests after binding the IP in hosts:

10.0.8.17 admin

The following error will also be reported:

webmesh-node \
    --global.mtls \
    --global.tls-cert-file=/etc/webmesh/tls/tls.crt \
    --global.tls-key-file=/etc/webmesh/tls/tls.key \
    --global.tls-ca-file=/etc/webmesh/tls/ca.crt \
    --mesh.join-address=admin:8443
time=2023-10-09T13:53:46.494+08:00 level=INFO msg="Starting webmesh node" version=0.12.0 commit=1583de5b73608fc93dfaee89bc483bcbbe97d37d buildDate=2023-10-08T21:21:00Z
time=2023-10-09T13:53:46.507+08:00 level=INFO msg="Joining webmesh cluster" component=mesh node-id=admin
time=2023-10-09T13:53:46.508+08:00 level=ERROR msg="Join request failed" component=mesh node-id=admin error="join: rpc error: code = Unavailable desc = connection error: desc = \"transport: authentication handshake failed: tls: failed to verify certificate: x509: certificate is not valid for any names, but wanted to match admin\""
tinyzimmer commented 1 year ago

Yea this is the purpose of the --global.verify-chain-only flags. The built-in PKI doesn't do proper certificates with domain names and/or IP addresses. So you have to use that flag. If you have a PKI that can issue valid SANs than that isn't necessary.

Using the built-in one comes with some challenges. Unless you have valid domain names, the IP addresses used for communication are not deterministic. The IPv4 addresses are random (unless configured statically). The IPv6 addresses are actually derived from the public wireguard keys, but you'd need to know the public key in advance of making the certificate.

80imike commented 1 year ago

Yea this is the purpose of the --global.verify-chain-only flags. The built-in PKI doesn't do proper certificates with domain names and/or IP addresses. So you have to use that flag. If you have a PKI that can issue valid SANs than that isn't necessary.

Using the built-in one comes with some challenges. Unless you have valid domain names, the IP addresses used for communication are not deterministic. The IPv4 addresses are random (unless configured statically). The IPv6 addresses are actually derived from the public wireguard keys, but you'd need to know the public key in advance of making the certificate.

I am now using v0.12.1, and have added --global.verify-chain-only to both bootstrap node and node.

$ ps aux|grep webmesh
root      251842  0.7  1.4 3632036 57836 ?       Ssl  09:31   0:02 /usr/bin/webmesh-node --global.detect-endpoints --global.detect-private-endpoints --global.mtls --global.tls-cert-file=/etc/webmesh/tls/tls.crt --global.tls-key-file=/etc/webmesh/tls/tls.key --global.tls-ca-file=/etc/webmesh/tls/ca.crt --global.verify-chain-only --bootstrap.enabled --bootstrap.default-network-policy=accept

But the following error still exists:

webmesh-node \
    --global.mtls \
    --global.tls-cert-file=/etc/webmesh/tls/tls.crt \
    --global.tls-key-file=/etc/webmesh/tls/tls.key \
    --global.tls-ca-file=/etc/webmesh/tls/ca.crt \
    --global.verify-chain-only \
    --mesh.join-address=10.0.8.17:8443
time=2023-10-10T09:37:02.729+08:00 level=INFO msg="Starting webmesh node" version=0.12.1 commit=37e78d7de2277e9d799b6fe705907e17bdad3891 buildDate=2023-10-09T22:16:34Z
time=2023-10-10T09:37:02.741+08:00 level=INFO msg="Joining webmesh cluster" component=mesh node-id=bootstrap-node
time=2023-10-10T09:37:02.743+08:00 level=ERROR msg="Join request failed" component=mesh node-id=bootstrap-node error="join: rpc error: code = Unavailable desc = connection error: desc = \"transport: authentication handshake failed: tls: failed to verify certificate: x509: cannot validate certificate for 10.0.8.17 because it doesn't contain any IP SANs\""
time=2023-10-10T09:37:03.743+08:00 level=INFO msg="Retrying join request" component=mesh node-id=bootstrap-node tries=1
time=2023-10-10T09:37:03.745+08:00 level=ERROR msg="Join request failed" component=mesh node-id=bootstrap-node error="join: rpc error: code = Unavailable desc = connection error: desc = \"transport: authentication handshake failed: tls: failed to verify certificate: x509: cannot validate certificate for 10.0.8.17 because it doesn't contain any IP SANs\""
time=2023-10-10T09:37:04.745+08:00 level=INFO msg="Retrying join request" component=mesh node-id=bootstrap-node tries=2
tinyzimmer commented 1 year ago

Hrm. This definitely worked in my local test so I'll need to dig deeper. You also have --global.insecure-skip-verify which ignores client-side TLS verification entirely but still does the mTLS exchange (I know that sounds weird). I'll definitely get to the bottom of whatever the issue is here though.

80imike commented 1 year ago

Hrm. This definitely worked in my local test so I'll need to dig deeper. You also have --global.insecure-skip-verify which ignores client-side TLS verification entirely but still does the mTLS exchange (I know that sounds weird). I'll definitely get to the bottom of whatever the issue is here though.

After adding --global.insecure-skip-verify, the problem still exists:

ps aux|grep webmesh
root       30388  0.6  1.4 3632036 57848 ?       Ssl  14:11   0:03 /usr/bin/webmesh-node --global.detect-endpoints --global.detect-private-endpoints --global.mtls --global.tls-cert-file=/etc/webmesh/tls/tls.crt --global.tls-key-file=/etc/webmesh/tls/tls.key --global.tls-ca-file=/etc/webmesh/tls/ca.crt --global.verify-chain-only --global.insecure-skip-verify --bootstrap.enabled --bootstrap.default-network-policy=accept
 webmesh-node \
    --global.mtls \
    --global.tls-cert-file=/etc/webmesh/tls/tls.crt \
    --global.tls-key-file=/etc/webmesh/tls/tls.key \
    --global.tls-ca-file=/etc/webmesh/tls/ca.crt \
    --global.verify-chain-only \
    --mesh.join-address=10.0.8.17:8443  --global.insecure-skip-verify
time=2023-10-10T14:18:55.738+08:00 level=INFO msg="Starting webmesh node" version=0.12.1 commit=37e78d7de2277e9d799b6fe705907e17bdad3891 buildDate=2023-10-09T22:16:34Z
time=2023-10-10T14:18:55.748+08:00 level=WARN msg="InsecureSkipVerify is enabled, skipping TLS verification"
time=2023-10-10T14:18:55.749+08:00 level=INFO msg="Joining webmesh cluster" component=mesh node-id=node
time=2023-10-10T14:18:55.751+08:00 level=ERROR msg="Join request failed" component=mesh node-id=node error="join: rpc error: code = Unavailable desc = connection error: desc = \"error reading server preface: remote error: tls: unknown certificate authority\""

Maybe there is something wrong with the certificate generated by wmctl pki?

tinyzimmer commented 1 year ago

Perhaps? Try with a fresh one if you haven't. In my local test I used that PKI.

tinyzimmer commented 1 year ago

Ok I've managed to reproduce these issues with a fresh PKI on my machine. So progress.

80imike commented 1 year ago

Perhaps? Try with a fresh one if you haven't. In my local test I used that PKI.

This is how I generated the certificate:

wmctl pki init --pki-directory pki
wmctl pki --pki-directory ./pki issue --name bootstrap
wmctl pki --pki-directory ./pki issue --name node

Then copy the certificate to /etc/webmesh/tls

cp pki/nodes/bootstrap/* /etc/webmesh/tls

cp pki/nodes/node/* /etc/webmesh/tls

Regeneration still doesn't work.

tinyzimmer commented 1 year ago

I'm starting to think a recent bump on grpc versions that I did might be causing this. But I still need to dig deeper. All the configurations are getting built correctly in the code it seems, but then grpc is not using them the way they are configured.

tinyzimmer commented 1 year ago

Ok I've managed to fix this locally, but still not entirely sure what changed to make this stop working. The TLDR is the server side of the connection needed to know to only verify the chain of the client certificate presented. Will tag a new release v0.12.2 release soon.

80imike commented 1 year ago

Ok I've managed to fix this locally, but still not entirely sure what changed to make this stop working. The TLDR is the server side of the connection needed to know to only verify the chain of the client certificate presented. Will tag a new release v0.12.2 release soon.

After updating to v0.12.2, the issue has been resolved. wireguard successfully established connection

You may also need to synchronize the modifications to the certificate part to wmctl, otherwise the same error will be reported in wmctl

wmctl --config config.yaml --tls-skip-verify  status
Error: rpc error: code = Unavailable desc = connection error: desc = "transport: authentication handshake failed: tls: failed to verify certificate: x509: cannot validate certificate for 10.0.8.17 because it doesn't contain any IP SANs"
tinyzimmer commented 1 year ago

Hrm, I did. The issue was server side, not client side actually. This may be a new bug.

tinyzimmer commented 1 year ago

I'm not able to reproduce this with a regular self-signed certificate at least. So the issue is not in the tls-skip-verify flag.

With running examples/simple/docker-compose.yaml

 $ wmctl version
Version: 0.12.2
Git Commit: cfaf36734ba90aef351a80afe7e44dbac4618133
Build Date: 2023-10-10T12:00:41Z

$ wmctl --tls-skip-verify --server localhost:8443 status
{
  "id": "bootstrap-node",
  "version": "0.12.3-next",
  "commit": "398d576b0e9d67d070fa0def5952d43881c1fbc4",
  "buildDate": "2023-10-11T06:48:51Z",
  "uptime": "12.478891976s",
  "startedAt": "2023-10-11T06:49:15.726294405Z",
  "features": [
    {
      "feature": "NODES",
      "port": 8443
    },
    {
      "feature": "STORAGE_QUERIER",
      "port": 8443
    },
    {
      "feature": "MEMBERSHIP",
      "port": 8443
    },
    {
      "feature": "STORAGE_PROVIDER",
      "port": 9000
    },
    {
      "feature": "LEADER_PROXY",
      "port": 8443
    }
  ],
  "clusterStatus": "CLUSTER_LEADER",
  "currentLeader": "bootstrap-node",
  "interfaceMetrics": {
    "deviceName": "webmesh0",
    "publicKey": "JO7MWv/XeLg1Qi6umif2Q7A1dNZMIYzZHcBUkMyCOwc=",
    "addressV4": "172.16.0.1/32",
    "addressV6": "fdd3:e183:eb46:84b7:2f13:ec58:b111:0/112",
    "type": "Linux kernel",
    "listenPort": 51820,
    "totalReceiveBytes": "12548",
    "totalTransmitBytes": "12012",
    "numPeers": 1,
    "peers": [
      {
        "publicKey": "CAxzPwOs9GREOVUP4/3qHg6qX7awxS4iHBDsTyd3Jm8=",
        "endpoint": "10.1.0.3:51820",
        "persistentKeepAlive": "30s",
        "lastHandshakeTime": "2023-10-11T06:49:21Z",
        "allowedIPs": [
          "172.16.0.2/32",
          "fdd3:e183:eb46:d470:1bed:e5b3:9571:0/112"
        ],
        "protocolVersion": "1",
        "receiveBytes": "12548",
        "transmitBytes": "12012"
      }
    ]
  }
}

Could be perhaps in the config. In which case it may be a bug in the code that generates that config.

80imike commented 1 year ago

I'm not able to reproduce this with a regular self-signed certificate at least. So the issue is not in the tls-skip-verify flag.

With running examples/simple/docker-compose.yaml

 $ wmctl version
Version: 0.12.2
Git Commit: cfaf36734ba90aef351a80afe7e44dbac4618133
Build Date: 2023-10-10T12:00:41Z

$ wmctl --tls-skip-verify --server localhost:8443 status
{
  "id": "bootstrap-node",
  "version": "0.12.3-next",
  "commit": "398d576b0e9d67d070fa0def5952d43881c1fbc4",
  "buildDate": "2023-10-11T06:48:51Z",
  "uptime": "12.478891976s",
  "startedAt": "2023-10-11T06:49:15.726294405Z",
  "features": [
    {
      "feature": "NODES",
      "port": 8443
    },
    {
      "feature": "STORAGE_QUERIER",
      "port": 8443
    },
    {
      "feature": "MEMBERSHIP",
      "port": 8443
    },
    {
      "feature": "STORAGE_PROVIDER",
      "port": 9000
    },
    {
      "feature": "LEADER_PROXY",
      "port": 8443
    }
  ],
  "clusterStatus": "CLUSTER_LEADER",
  "currentLeader": "bootstrap-node",
  "interfaceMetrics": {
    "deviceName": "webmesh0",
    "publicKey": "JO7MWv/XeLg1Qi6umif2Q7A1dNZMIYzZHcBUkMyCOwc=",
    "addressV4": "172.16.0.1/32",
    "addressV6": "fdd3:e183:eb46:84b7:2f13:ec58:b111:0/112",
    "type": "Linux kernel",
    "listenPort": 51820,
    "totalReceiveBytes": "12548",
    "totalTransmitBytes": "12012",
    "numPeers": 1,
    "peers": [
      {
        "publicKey": "CAxzPwOs9GREOVUP4/3qHg6qX7awxS4iHBDsTyd3Jm8=",
        "endpoint": "10.1.0.3:51820",
        "persistentKeepAlive": "30s",
        "lastHandshakeTime": "2023-10-11T06:49:21Z",
        "allowedIPs": [
          "172.16.0.2/32",
          "fdd3:e183:eb46:d470:1bed:e5b3:9571:0/112"
        ],
        "protocolVersion": "1",
        "receiveBytes": "12548",
        "transmitBytes": "12012"
      }
    ]
  }
}

Could be perhaps in the config. In which case it may be a bug in the code that generates that config.

It seems that mtls is not enabled in examples/simple/docker-compose.yaml? This problem should only appear after enabling mtls

tinyzimmer commented 1 year ago

Yea I'll have to do a test with that. Just seems odd because the underlying concept is the same and you are getting a client-side TLS error.

80imike commented 1 year ago

Yea I'll have to do a test with that. Just seems odd because the underlying concept is the same and you are getting a client-side TLS error.

Okay, you can reproduce this problem after enabling mtls

tinyzimmer commented 1 year ago

Yep wasn't doubting you. And a test shows the error surface client side with mTLS. It just is still odd :stuck_out_tongue:

tinyzimmer commented 1 year ago

It is down to an issue with the PKI. Basically it needs to encode the CA certificate on top of the server certificate when writing the files. This way both are presented to the client.

tinyzimmer commented 1 year ago

I seem to have the issue resolved on latest. Going to squeeze a few more things in before tagging.

tinyzimmer commented 1 year ago

v0.12.3 will have the fix :smile:

80imike commented 1 year ago

v0.12.3 will have the fix 😄

After updating to v0.12.3, the problem still exists, The error is reported as follows:

wmctl --config config.yaml  status
Error: rpc error: code = Unavailable desc = connection error: desc = "transport: authentication handshake failed: failed to verify certificate: x509: certificate signed by unknown authority"

webmesh-node also started to report certificate errors:

webmesh-node \ --global.disable-ipv6 \ --global.mtls \ --global.tls-ca-file=/etc/webmesh/tls/ca.crt \ --global.tls-cert-file=/etc/webmesh/tls/tls.crt \ --global.tls-key-file=/etc/webmesh/tls/tls.key \ --global.verify-chain-only \ --global.insecure-skip-verify \ --mesh.join-address=10.0.8.17:8443 time=2023-10-12T09:34:07.950+08:00 level=INFO msg="Starting webmesh node" version=0.12.3 commit=22891c6630c908bc31ff779450bfc19d3548caf9 buildDate=2023-10-11T12:17:55Z time=2023-10-12T09:34:07.961+08:00 level=WARN msg="InsecureSkipVerify is enabled, skipping TLS verification" time=2023-10-12T09:34:07.961+08:00 level=WARN msg="VerifyChainOnly is enabled, only verifying the certificate chain" time=2023-10-12T09:34:07.961+08:00 level=INFO msg="Joining webmesh cluster" component=mesh node-id=node time=2023-10-12T09:34:07.963+08:00 level=ERROR msg="Join request failed" component=mesh node-id=node error="join: rpc error: code = Unavailable desc = connection error: desc = \"transport: authentication handshake failed: failed to verify certificate: x509: certificate signed by unknown authority\"" time=2023-10-12T09:34:08.963+08:00 level=INFO msg="Retrying join request" component=mesh node-id=node tries=1

tinyzimmer commented 1 year ago

You need to regenerate your PKI or manually add the CA certificates to the tls.crt chains if you haven't. The PKI generation was part of the bug.

80imike commented 1 year ago

You need to regenerate your PKI or manually add the CA certificates to the tls.crt chains if you haven't. The PKI generation was part of the bug.

I regenerated the PKI and the new certificate included the SAN:

# bootstrap-node

        X509v3 extensions:
            X509v3 Key Usage: critical
                Digital Signature, Key Encipherment
            X509v3 Extended Key Usage:
                TLS Web Client Authentication, TLS Web Server Authentication
            X509v3 Authority Key Identifier:
                keyid:52:91:DF:1E:CE:0F:51:65:7C:77:92:6A:20:37:7C:A3:40:B6:05:13

            X509v3 Subject Alternative Name:
                DNS:bootstrap
# node
        X509v3 extensions:
            X509v3 Key Usage: critical
                Digital Signature, Key Encipherment
            X509v3 Extended Key Usage:
                TLS Web Client Authentication, TLS Web Server Authentication
            X509v3 Authority Key Identifier:
                keyid:52:91:DF:1E:CE:0F:51:65:7C:77:92:6A:20:37:7C:A3:40:B6:05:13

            X509v3 Subject Alternative Name:
                DNS:node

The problem with wmctl has been solved, but a new problem has arisen:

 webmesh-node \
  --global.disable-ipv6 \
  --global.mtls \
  --global.tls-ca-file=/etc/webmesh/tls/ca.crt \
  --global.tls-cert-file=/etc/webmesh/tls/tls.crt \
  --global.tls-key-file=/etc/webmesh/tls/tls.key \
  --global.verify-chain-only \
  --global.insecure-skip-verify \
  --mesh.join-address=10.0.8.17:8443
time=2023-10-12T10:14:18.418+08:00 level=INFO msg="Starting webmesh node" version=0.13.1 commit=1777a4846c653cd848056c0c139f2f6f8e6cad7a buildDate=2023-10-11T23:28:08Z
time=2023-10-12T10:14:18.429+08:00 level=WARN msg="InsecureSkipVerify is enabled, skipping TLS verification"
time=2023-10-12T10:14:18.429+08:00 level=WARN msg="VerifyChainOnly is enabled, only verifying the certificate chain"
time=2023-10-12T10:14:18.429+08:00 level=INFO msg="Joining webmesh cluster" component=mesh node-id=node
time=2023-10-12T10:14:18.433+08:00 level=ERROR msg="Join request failed" component=mesh node-id=node error="join: rpc error: code = PermissionDenied desc = node id node does not match authenticated caller"
time=2023-10-12T10:14:19.434+08:00 level=INFO msg="Retrying join request" component=mesh node-id=node tries=1
time=2023-10-12T10:14:19.437+08:00 level=ERROR msg="Join request failed" component=mesh node-id=node error="join: rpc error: code = PermissionDenied desc = node id node does not match authenticated caller"
tinyzimmer commented 1 year ago

This was actually a different problem not related to mtls that should be solved on the most recent tags. It surfaced when I was adding in the ID auth stuff.

80imike commented 1 year ago

This was actually a different problem not related to mtls that should be solved on the most recent tags. It surfaced when I was adding in the ID auth stuff.

The issue has been resolved in v0.13.3. The reason is: wmctl issue When generating a certificate, the name of the corresponding node machine must be used. For example:

wmctl pki --pki-directory ./pki issue --name mike-node-12

The value of the -name parameter determines the Subject Alternative Name of the certificate. This is very important.

80imike commented 1 year ago

Although now wmctl --config admin.yaml status mike-node-12 --tls-skip-verify is available. Then I found another wmctl error:

$ wmctl --config admin.yaml  get nodes  mike-node-12 --tls-skip-verify
Error: rpc error: code = Unimplemented desc = unknown service v1.Mesh

It seems that except wmctl status, which can be used normally, other subcommands report errors.

tinyzimmer commented 1 year ago

This is because you don't have the mesh api enabled. We are drifting into documentation territory and not so much mTLS stuff. But the TLDR is you need to pass services.api.mesh-enabled to the node you are wanting to do that against.

Will close this issue for now and we can open new ones for any other bugs.