Fix OpenShift project stuck with “Terminating” status

This is the story of a strange issue that I ran into while working with OpenShift projects (k8s namespaces) when IBM Cloud Pak for Integration was installed. It is a very strange issue that I still don’t understand what caused it but I will keep an eye out to find any patterns and update this.

This all started when I decided to create a sample project for a non-important task that I can’t recall and eventually I decided to delete it with the command:

$ oc delete project forgotten-project

Moved on and it wasn’t until a few days later that I was doing a similar action and I noticed that the forgotten-project was still there with a status of “Terminating” when I listed all the projects in my cluster. Interesting enough at this point there were 4 different projects in the same status because I never bother to check if they were really gone.

Hop into Google and search how to get rid of them and it was pretty easy to spot 2 approaches:

A resource may be stuck and needs to be manually installed

Fair enough, seems like an easy task to do, I’d just list all the resources and delete them so the project can go away.

$ oc get all -n forgotten-project
No resources found.

Wait… what? Nothing is stuck and the project is not gone. Ok, let’s look at the second approach.

Force delete the project

Surely if the former didn’t do the trick this would do it. Here we go:

$ oc delete project forgotten-project --grace-period=0 --force
warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
Error from server (Conflict): Operation cannot be fulfilled on namespaces "forgotten-project": The system is ensuring all content is removed from this namespace.  Upon completion, this namespace will automatically be purged by the system.

Oh no. This isn’t good news, what is going on? This option is also no-bueno so it’s time to go back to Google.

Finding the culprit

After going more than the 3rd page of results at Google, switching “how” I was searching for my issue and I don’t know what else, something suggested to check all the api resources, at this point I’m already looking into solutions using kubectl and not the oc CLI.

Tried to check the api-resources and found the error:

$ kubectl api-resources -o wide
...
# Tons of regular results and then...
...
error: unable to retrieve the complete list of server APIs: admission.certmanager.k8s.io/v1beta1: Unauthorized

Ok. I wasn’t expecting an error here but seems like it could be related. Back to Google.

Alright, found something related in this article

Following those instructions I found out that one of the pods was in an error state:

$ kubectl logs cert-manager-webhook-fdc4d7869-j9lxx -n cert-manager
....
E0330 23:30:50.778426       1 authentication.go:65] Unable to authenticate the request due to an error: x509: certificate signed by unknown authority
E0330 23:30:50.778486       1 authentication.go:65] Unable to authenticate the request due to an error: x509: certificate signed by unknown authority

This is definitely not right, maybe a certificate rotation didn’t work out correctly for this pod. My first attempt to solve it is by deleting the pod since it will be re-created automatically:

$ kubectl delete pod cert-manager-webhook-fdc4d7869-p28bm -n cert-manager

Wait a few seconds and check the status of the new pod:

$ kubectl logs cert-manager-webhook-fdc4d7869-j9lxx -n cert-manager
flag provided but not defined: -v
Usage of tls:
  -tls-cert-file string

I0330 23:32:25.157520       1 secure_serving.go:116] Serving securely on [::]:1443

Yes! That’s progress, let’s go back to the previous issue of the error while listing the api-resources in the cluster.

$ kubectl api-resources -o wide

It’s fixed, no errors are shown. Great! Now let’s go back to check on the project that was stuck in “Terminating” status using oc get projects and… it’s gone (along with the other projects stuck)!

Problem solved but the mystery isn’t and I think I understand why the error was happening but I’m not sure how to put it in words that make sense for you, at least for now.

In any case, if you happen to find that issue and find this blog, I hope this history is similar to yours and more importantly, it ends the same way as mine: with a solution.

Until the next time.

Published 31 Mar 2020

Tips, tricks and lessons learned by a cloud integration architect.Cesar Cavazos on Twitter