Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tailscale-operator not working on AKS #8111

Closed
tiesmaster opened this issue May 10, 2023 · 15 comments · Fixed by #8762 · May be fixed by #8749
Closed

tailscale-operator not working on AKS #8111

tiesmaster opened this issue May 10, 2023 · 15 comments · Fixed by #8762 · May be fixed by #8749
Labels
bug Bug kubernetes L2 Few Likelihood P2 Aggravating Priority level T5 Usability Issue type

Comments

@tiesmaster
Copy link

What is the issue?

I want to expose my AKS kubernetes cluster via Tailscale, and tried to do that via the Tailscale proxy, however, that doesn't seems to proxy the traffic to the destination IP. As suggested here, I tried out the Tailscale Operator, hoping that that might work, however, I'm running into the same issue here.

To troubleshoot the issue, I've ramped up a secondary cluster, using microk8s, and I've got the Tailscale Operator working there, as you can see in the logs of the proxy pod:

Proxy pod on AKS

$ kubectl logs -n tailscale pod/ts-sample-workload-one-bjvhs-0
...
2023/05/10 08:27:55 Accept: TCP{100.115.48.77:64882 > 100.79.209.22:80} 52 tcp ok
2023/05/10 08:27:56 Accept: TCP{100.115.48.77:64882 > 100.79.209.22:80} 52 tcp ok

Proxy pod on microk8s

$ kubectl logs -n tailscale pod/ts-sample-workload-one-7nst7-0
...
2023/05/10 08:27:40 Accept: TCP{100.115.48.77:64880 > 100.67.186.56:80} 52 tcp ok
2023/05/10 08:27:40 Accept: TCP{100.67.186.56:80 > 100.115.48.77:64880} 52 ok out
2023/05/10 08:27:40 Accept: TCP{100.115.48.77:64880 > 100.67.186.56:80} 40 tcp non-syn

This is what I get when I curl the Tailscale IP of the sample workload (which is this Deployment, and Service that I've used for that). When I hit the AKS cluster, nothing happens, when I hit the microk8s cluster, I get the expected output.

I've looked at the source of the Tailscale proxy image (which turns out to also be used under the hood of the Tailscale Operator), I see that it basically only sets up some iptables rules. So this is something you should also see in the proxy containers on both clusters. I don't have any experience with iptables, and I don't really know how to troubleshoot this, but I've found a couple commands to inspect things. I see that the AKS one has some rules installed, but also missing quite a few.

Is this a bug in the Tailscale proxy image? Or should I report this to AKS?

iptables output on AKS

/ # iptables -nvL
Chain INPUT (policy ACCEPT 682 packets, 287K bytes)
 pkts bytes target     prot opt in     out     source               destination

Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)
 pkts bytes target     prot opt in     out     source               destination

Chain OUTPUT (policy ACCEPT 605 packets, 173K bytes)
 pkts bytes target     prot opt in     out     source               destination

Chain ts-forward (0 references)
 pkts bytes target     prot opt in     out     source               destination

Chain ts-input (0 references)
 pkts bytes target     prot opt in     out     source               destination
/ # iptables -vL
Chain INPUT (policy ACCEPT 682 packets, 287K bytes)
 pkts bytes target     prot opt in     out     source               destination

Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)
 pkts bytes target     prot opt in     out     source               destination

Chain OUTPUT (policy ACCEPT 605 packets, 173K bytes)
 pkts bytes target     prot opt in     out     source               destination

Chain ts-forward (0 references)
 pkts bytes target     prot opt in     out     source               destination

Chain ts-input (0 references)
 pkts bytes target     prot opt in     out     source               destination
/ # iptables -w -t nat -L "PREROUTING"
Chain PREROUTING (policy ACCEPT)
target     prot opt source               destination
DNAT       all  --  anywhere             100.79.209.22        to:10.0.45.41

iptables output on microk8s

/ # iptables -nvL
Chain INPUT (policy ACCEPT 100 packets, 25080 bytes)
 pkts bytes target     prot opt in     out     source               destination
  100 25080 ts-input   all  --  *      *       0.0.0.0/0            0.0.0.0/0

Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)
 pkts bytes target     prot opt in     out     source               destination
    0     0 ts-forward  all  --  *      *       0.0.0.0/0            0.0.0.0/0

Chain OUTPUT (policy ACCEPT 93 packets, 18922 bytes)
 pkts bytes target     prot opt in     out     source               destination

Chain ts-forward (1 references)
 pkts bytes target     prot opt in     out     source               destination
    0     0 MARK       all  --  tailscale0 *       0.0.0.0/0            0.0.0.0/0            MARK xset 0x40000/0xff0000
    0     0 ACCEPT     all  --  *      *       0.0.0.0/0            0.0.0.0/0            mark match 0x40000/0xff0000
    0     0 DROP       all  --  *      tailscale0  100.64.0.0/10        0.0.0.0/0
    0     0 ACCEPT     all  --  *      tailscale0  0.0.0.0/0            0.0.0.0/0

Chain ts-input (1 references)
 pkts bytes target     prot opt in     out     source               destination
    0     0 ACCEPT     all  --  lo     *       100.67.186.56        0.0.0.0/0
    0     0 RETURN     all  --  !tailscale0 *       100.115.92.0/23      0.0.0.0/0
    0     0 DROP       all  --  !tailscale0 *       100.64.0.0/10        0.0.0.0/0
/ # iptables -w -t nat -L "PREROUTING"
Chain PREROUTING (policy ACCEPT)
target     prot opt source               destination
DNAT       all  --  anywhere             default-sample-workload-one-1.shark-egret.ts.net  to:10.152.183.137

Steps to reproduce

  1. Deploy an AKS cluster:
  2. Deploy the Tailscale Operator
  3. Deploy some sample workload that has the loadBalancerClass set to tailscale, like this one:
    • kubectl apply -f https://gist.githubusercontent.com/tiesmaster/d7b397f19015514451fd0cd58b37fb06/raw/ed83bca958ee43ab107f2468a849918d4d0da87f/sample-workload-one.yaml
  4. Hit the sample workload endpoint with curl

Are there any recent changes that introduced the issue?

No response

OS

Other

OS version

AKS with Kubernetes version 1.26.3

Tailscale version

No response

Other software

No response

Bug report

No response

@DentonGentry DentonGentry added kubernetes L2 Few Likelihood P2 Aggravating Priority level T5 Usability Issue type and removed needs-triage labels May 27, 2023
@DentonGentry DentonGentry changed the title Tailscale proxy or tailscale-operator not working on AKS? tailscale-operator not working on AKS Jun 4, 2023
@rodrigc
Copy link
Contributor

rodrigc commented Jun 16, 2023

I suspect that these issues might be related:

#8244
#5621
#391

@tiesmaster
Copy link
Author

@rodrigc Awesome! Thanks for the links. I'm gonna do some reading up on those, and report back here with my findings

@wadhah101
Copy link

@tiesmaster Exact same issue with AKS and tailscale proxy

@tiesmaster
Copy link
Author

@wadhah101 Thanks for the +1! I suspected it would brake for others on AKS (should be, right?), though, good to know, and good to know that others would be helped with getting this fixed (and perhaps pushed upstream). I looked into the other issues briefly, but didn't manage to do a write-up here.

What I found out is that it looked like that AKS is not using iptables-ng, as opposed to what you would be expecting from the other issues. So I cannot explain this, so far. I going to look into this over the course of the next week. I'll keep you posted when I have some results.

@wadhah101
Copy link

wadhah101 commented Jun 30, 2023

@tiesmaster Thanks for your response, I lack the skills to debug this unfortunately. Thanks for keeping everyone posted 🙏

@tiesmaster
Copy link
Author

@wadhah101 Quick update here, as I didn't manage to look into this the last couple of weeks, and I will be on holiday from now on, for over 2 weeks. Next, I don't know if I'll be able to work on this even after that for the short term, so I cannot give any guarantees. I'll keep you posted when I do.

@rodrigc
Copy link
Contributor

rodrigc commented Jul 20, 2023

@tiesmaster @wadhah101 wadhah101 Major update posted by @KevinLiang10 here:
#391 (comment)

It may or may not help with your issue, but is worth testing

@clarkezone
Copy link

I'm having a similar issue on a k3s cluster #8733 wondering if this could be related.

clarkezone added a commit to clarkezone/tailscale that referenced this issue Jul 30, 2023
…le TS_DEBUG_USE_NETLINK_NFTABLES in tailscaled that was introduced in tailscale#8555

Fixes tailscale#8111, tailscale#8733
clarkezone added a commit to clarkezone/tailscale that referenced this issue Jul 30, 2023
…le TS_DEBUG_USE_NETLINK_NFTABLES in tailscaled that was introduced in tailscale#8555

Fixes tailscale#8111, tailscale#8733
clarkezone added a commit to clarkezone/tailscale that referenced this issue Jul 30, 2023
…le TS_DEBUG_USE_NETLINK_NFTABLES in tailscaled that was introduced in tailscale#8555

Fixes tailscale#8111, tailscale#8733
clarkezone added a commit to clarkezone/tailscale that referenced this issue Jul 30, 2023
Add flag to k8s-operator to enable TS_DEBUG_USE_NETLINK_NFTABLES in tailscaled that was introduced in tailscale#8555

Fixes tailscale#8111, tailscale#8733
clarkezone added a commit to clarkezone/tailscale that referenced this issue Jul 30, 2023
Add flag to k8s-operator to enable TS_DEBUG_USE_NETLINK_NFTABLES in tailscaled that was introduced in tailscale#8555

Fixes tailscale#8111, tailscale#8733

Signed-off-by: James Clarke <james@clarkezone.net>
@clarkezone
Copy link

Based on my testing, this issue appears to be resolve by the experimental NFTABLES support that was recently merged. I developed a PR to enable this in the k8s-operator and was able to get the operator working in an AKS cluster correctly as well as in my k3s case above. If you want to try my fix, you can patch the operator.yaml file as follows to pick up my privates and enable NFT mode. Pertinent changes are operator image, PROXY_IMAGE and new OPERATOR_USENFT env var:

 containers:
         - name: operator
           image: clarkezone/tsoperatornftfix:latest
           resources:



          - name: CLIENT_SECRET_FILE
             value: /oauth/client_secret
          - name: PROXY_IMAGE
             value: "clarkezone/tsclientnftfix:latest"
          - name: PROXY_TAGS


          - name: OPERATOR_USENFT
             value: "true"

@rodrigc
Copy link
Contributor

rodrigc commented Jul 30, 2023

@clarkezone That's great that you confirmed the fix.
Even better you submitted a patch!

In your patch to the operator, is it possible to somehow detect that nftables exists on the
system, and then spit out a log informing the user that they should set this particular variable to get things working?

Bravo for your work and @KevinLiang10 's original patch!

@KevinLiang10
Copy link
Contributor

@rodrigc Hi I'm working on detecting wether nftables/iptables available or used on machine, and just use it. Tho the implementation is not targeting to solve k8s problems, rather an implementation to relief user from having to explicit set env var to use the new nftables feature.

spit out a log informing the user that they should set this particular variable to get things working

Thanks for this advice we will discuss about adding it to our run time logging. It's because we are detecting nftables/iptables support at runtime.

I'll link these k8s issues when I put the pr up and test if the solution would help on this issue!

@rodrigc
Copy link
Contributor

rodrigc commented Aug 2, 2023

@KevinLiang10 Nice work. Long term the heuristics you are implementing as part of #8762 to auto-select
iptables/nftables and choose iptables or nftables is the right approach.
Long term, having the user depend on too many weird knobs is not ideal.

In #8762, if you can add good logging, so that as you develop the feature, end-users can get some
idea how the feature is working in their environment, that would be good.
That will also help identify any weird corner cases where your heuristic logic breaks down.

@KevinLiang10
Copy link
Contributor

KevinLiang10 commented Aug 2, 2023

@rodrigc I've added logs for which firewall tool is being used, and the tool's condition. Apart from that what logs do you think are particularly interesting for you?

Really would love to hear what you are interested, your opinions will definitely help us to improve user experience!

@rodrigc
Copy link
Contributor

rodrigc commented Aug 2, 2023

@KevinLiang10 These are some ideas I have of things to log:

  1. What firewall tool is being used?
  2. What is the version of the firewall tool?
  3. What is the version of the kernel?
  4. What firewall capability is available in the kernel?
  5. What is the version of the firewall capabilty compiled into the kernel?
  6. If it is available, output of lsb_release -a is sometimes helpful for quickly figuring out version/distro information

Basically, anything that can help the user debug
when things go wrong, with iptables / nftables

@rodrigc
Copy link
Contributor

rodrigc commented Aug 8, 2023

@KevinLiang10 the other perspective I would suggest to you, is
from your perspective at Tailscale,
what kinds of information/logs would you want the user to provide to you in a bug report,
to better troubleshoot this particular feature?

I gave you a list of what I think are good pieces of information to log,
but as the author of this feature, you might have more ideas.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Bug kubernetes L2 Few Likelihood P2 Aggravating Priority level T5 Usability Issue type
Projects
None yet
6 participants