Wednesday, December 6, 2023

The naming problem isn’t complicated if you know your use case

Chris Hughes recently put out a very well-written post (not that this is any exception for him, of course) on the naming problem, which focuses on the document that CISA and DHS published a couple of months ago[i]. The latter document was also well written, but both it and Chris’ post suffer from the fact that they don’t start with a specific use case – and therefore they both end by throwing up their hands and saying the problem is “complicated”. I agree that the naming problem is complicated if you try to address multiple use cases, or worse, if you don’t specify any use case at all (in which case you implicitly assume the burden of addressing all use cases). But solving the naming problem, at least in principle, is easy if you confine yourself to a specific use case.

To identify your use case, ask why software naming is important to you. After all, nobody loses sleep over the fact that software names are currently confusing and inconsistent, unless this hinders them in accomplishing something they need to accomplish.

Here are some (of many) use cases for which a solution to the naming problem would be important:

1.      An end user recently used an open source software product that they find helpful, but they don’t know where to find it. In fact, they find it is available in multiple repositories, although under slightly different names and version numbers. How can they find the exact one that they used?

2.      A company has heard of a proprietary software product that they want to buy for their own use, but they don’t know the name of the supplier of that product.

3.      A software user has received a software bill of materials (SBOM) from the supplier of a product they use heavily. They would like to learn about vulnerabilities that apply to the components found in the product, but the identifiers for the components in the SBOM don’t seem to be in any vulnerability database they can find.

4.      A software supplier needs to report a vulnerability in their product to CVE.org, which assigns CVE numbers, but doesn’t know how to identify it.

These are very different use cases, and there are many others. There’s no assurance that any discussion of a solution to the “naming problem” will solve all of them or even more than one of them. Thus, any discussion of the naming problem needs to start with an identification of the use case being addressed. Items 3 and 4 are both part of the use case discussed in this post.

The use case that drives this discussion is identifying vulnerabilities applicable to a software product, and I believe it’s behind both CISA’s and Chris Hughes’ arguments, although it’s not explicitly stated. Of course, software vulnerability management is very important for cybersecurity in general. The topic has taken on increasing importance in the last couple of years, in good part because of the focus on SBOMs. This is because:

1.      Without an SBOM, the user of a software product, who is concerned about the product’s security, would only need to have one identifier: that for the product (as well as the version of the product) itself.

2.      If the user gets an SBOM and wants to use that for vulnerability management, they suddenly need to have an identifier for each of the components in that product, not just for the product itself. Since the average software product has around 150 components, you could say the identifier problem for them is now 150 times larger.

3.      If the SBOM doesn’t list component identifiers that can be found in a vulnerability database, it isn’t likely the user will ever be able to learn about vulnerabilities due to those components. In other words, SBOMs will be useless for vulnerability management. And this isn’t speculation: I’ve been told by multiple developers of software and intelligent devices that only in a small percentage of cases can the identifier for a component in an SBOM produced by an automated process be found in a vulnerability database. Specifically, I’ve heard from a couple major software suppliers that fewer than 5% of component identifiers meet that standard, in a typical automated SBOM.

4.      This is why software suppliers who wish to provide SBOMs to their customers that are usable for vulnerability management (and if you look at articles and posts about SBOMs, a large percentage of them – certainly the majority - focuses on vulnerability management as the only use case) have to spend a lot of time finding useful component identifiers. They need to utilize a whole grab bag of tools to do this: AI/ML, fuzzy logic, guesswork, collections of documents like GitHub commits, prayer, etc. Of course, none of this work can be fully automated. It’s like you rode a high-speed train from Chicago to New York City (a guy can dream, right?), but you had to travel the last mile to the station by oxcart.

Now that we know our use case is vulnerability management, what do we need to find out first, in order to solve our naming problem? That’s not a hard question; we need to find out what identifiers are used in vulnerability databases today. It turns out there are just two of them: CPE (found in the National vulnerability database or NVD) and purl (found in almost every vulnerability database other than the NVD and databases derived from the NVD. In fact, some very knowledgeable people – including Philippe Ombredanne of nexB, the developer of the purl concept – have told me they don’t know of a single public vulnerability database that is based on anything other than CPE or purl, and the great majority of vulnerability databases are based on purl). Of course, there are lots of other software identifiers that are useful for payments to suppliers, licensing, etc. – but these are all different use cases, and we don’t need to consider them any further now.

At this point, let’s make the rest of our job easier. Let’s agree that, in considering possible identifiers for use in vulnerability databases, we confine ourselves to the ones that are already in use. The only reason why we wouldn’t do this is if we examine both CPE and purl, and decide that both of them suffer from serious problems, meaning we should look elsewhere for identifiers.

Let’s look at CPE first. Both Chris’ post and the CISA/DHS document do a good job of describing the concept of CPE. I agree that CPE sounds great (well, good at least) in concept, but what about in practice? Since CPE has been in existence for at least a couple of decades, we don’t have to guess about how it will perform in practice; we can look at its record. That record is laid out in some detail in the OWASP SBOM Forum’s “Proposal to Operationalize Component Identification for Vulnerability Management”, on pages 4-6 (I led development of that document, although Steve Springett, Tony Turner, Kate Stewart and David Wheeler were responsible for the ideas. Chris provided a very intelligent description of the document in his post, which I greatly appreciated. It was definitely the best description I’ve seen so far by anybody who wasn’t involved in writing the document).

Please read those pages for yourself, but there’s one sentence that summarizes how suitable CPE is for the use case we’re most concerned with: looking up components found in an SBOM in a vulnerability database. The sentence is (p. 4): “Oracle Corporation estimates they can identify CPEs for no more than 20% of the components in their software products.” If you think of it, that’s quite an indictment. If Oracle can’t identify CPEs for 20% of the components in their own products, what chance does a poor end user like you have of identifying CPEs for the components listed in an SBOM you receive from a supplier whose products you use? The chance that a snowball has in the Infernal Regions? Not even that?

I hope you get the idea: CPE is a big part of the problem. It’s definitely not part of the solution. Meanwhile, it’s remarkable that literally every vulnerability database in the world that isn’t the NVD, or one of the handful of direct derivatives of the NVD, uses purl. Knowing that, I have to say “Case closed.” Purl has won the battle for supremacy among software identifiers, although because of the huge base of CPE data now, which is still quite valuable despite having a lot of errors, CPE won’t go away anytime soon. However, what I would like to see is purl being used as much as possible going forward.

For open source software vulnerabilities, purl is the unquestioned king. However, for proprietary software and for intelligent devices, there’s still work to be done. Regarding proprietary software, in the paper we (i.e., the OWASP SBOM Forum) proposed that SWID tags be made the basis for a new purl type. Steve Springett – who is a purl maintainer and worked with Philippe Ombredanne in some of the original development of the concept – has already taken care of that, but what remains is to figure out the best way (or ways) to make SWID tags (or more specifically, the information in SWID tags) available to users of proprietary software, especially legacy proprietary software. There are in fact many ways that could be done, but deciding which is/are best will be a challenge. To be honest, we haven’t started to work on that yet.

Regarding devices, we proposed in the paper that the existing GTIN and GMN naming conventions (which are proprietary and licensed by GS1) be used, since they are already being widely used for trade purposes. However, I’m wondering how vulnerability reporting would work in that case, since the names may be proprietary. I would like to explore the idea of developing new purl types to handle devices.

It might seem strange that an identifier that works well with open source software (OSS) would work well with proprietary software, but especially with hardware devices. After all, purl for OSS is based on where the software was downloaded from, and I don’t think anybody has figured out how to download a hardware device yet (perhaps using quantum teleportation?).

However, what is required for purl to work is for the user to be able to construct the identifier based on information they already know. For OSS, the user knows where they got the software. For proprietary software, our proposal suggests that the user would know the contents of a SWID tag for the product they’re using (although we haven’t done the work to figure out the best way to make the contents available to the user – they may need to get it from a pre-specified location on the supplier’s web site).

For devices, I’m thinking there might be an information source similar to a SWID tag – but I admit I haven’t talked to anybody else about this yet.

The moral of this story is that there’s no longer any question what the best identifier for software (and maybe hardware) is when the use case is vulnerability management – that’s purl. Fortunately or unfortunately, there is still a lot of work left to be done for the vulnerability identification use case, including

1.      How to identify proprietary software and devices using purl, as described above.

2.      How to report and track vulnerabilities for hardware devices, since the vulnerabilities aren’t in – say – the sheet metal or plastic that the device is made out of, but rather the software and/or firmware installed in the device. I expect to write a post on that question soon.

3.      There are peripheral parts of the naming problem that need to be solved, including aliasing (which applies primarily to proprietary products). Steve Springett has a nifty idea for solving that problem known as Common Lifecycle Enumeration. If you’re interested in working on that problem, I know he would love to hear from you. If you email me, I’ll forward you to Steve.

 Any opinions expressed in this blog post are strictly mine and are not necessarily shared by any of the clients of Tom Alrich LLC. If you would like to comment on what you have read here, I would love to hear from you. Please email me at tom@tomalrich.com.

I lead the OWASP SBOM Forum. If you would like to learn more about what that group does or contribute to our group, please go here.


[i] The post also focuses on Lindsey Cerkovnik of CISA’s presentation on naming at S4.

No comments:

Post a Comment