Saturday, June 1, 2024

We’re making progress on vulnerability database issues

Vulnerability database expert Brian Martin and I have been having a good back-and-forth discussion on LinkedIn about vulnerability database issues in general, including discussion of my proposal for a Global Vulnerability Database.

Today, Brian put up a new post that moves the discussion forward. His post includes 6 or 7 passages that point to what I think are common misconceptions that haven’t been well articulated previously. Because Brian has articulated them so clearly, I want to comment on each one of them. I’ll quote each of Brian's passages in red and then comment in black italics: 

While a Persistent Uniform Resource Locators (PURL) is one solution, it isn’t the only one used by vulnerability databases. So not only do you need to have an intelligent mapping from PURL to PURL, you also need it from CPE to PURL, and possibly other identifiers. It’s easy to have multiple valid PURLs all for the same piece of software.

BTW, purl in this context stands for “package URL”. Here is a good description of purl, posted by Philippe Ombredanne, the creator of purl.

Brian, when you say “mapping from purl to purl”, I think you’re talking about my earlier comment about comparing a CVE-purl connection in OSS Index with the same connection in CVE.org (once the CNAs start creating those). That’s a very special case, which I’d prefer to discuss with you offline.

However, “mapping CPE to purl” is literally impossible if there is more than one package manager for a particular OSS project. This is because most CPEs for open source software don’t refer to the package manager (except sometimes as part of the product name), meaning the user has no way of knowing which PM the vulnerability is found in.

Regarding the last sentence, “It’s easy to have multiple valid PURLs all for the same piece of software”, the problem is there’s no way to be certain that the code for a product named “log4core” in one package manager is bit-for-bit identical to the code for the “same” product in another package manager. Given that, the fact that CVE-12345 is found in one PM doesn’t allow you to conclude that it will be found in another PM.

This in one way is a limitation of purl, since you can’t make a statement that for example, CVE-12345 applies to all package managers that contain a product called “log4core”. You can only make that statement if you have tested log4core in all package managers. Purl will keep the CNA honest, meaning they will only list a purl in a CVE report if they have tested the product in that package manager – and a user should never assume a CVE in one PM will apply to another. In other words, CPE gives the user a false sense of comprehensiveness. 

Somewhere there are / were CPE specifications, likely before NVD took control of it. Early in the VulnDB days, we used them so we could generate our own CPE for products that didn’t appear in NVD. The fact that a seasoned vulnerability practitioner isn’t sure standards exist speaks volumes to how poorly NVD has managed CPE.

As unaccustomed as I am to defending NVD, I need to do so now. There’s simply no way there can be a unique CPE for any product – i.e., one that any user will always be able to create accurately. Pages 7-9 of the OWASP SBOM Forum’s 2022 document on the naming problem differentiate extrinsic identifiers like CPE from intrinsic identifiers like purl.

Briefly, an extrinsic identifier requires the user to do a lookup to at least one external database, before they can be sure they have the correct identifier. In the case of CPE, that database is the CPE Dictionary. On the other hand, an intrinsic identifier like purl just requires the user to enter information they already know with certainty: the package manager from which they downloaded the software, the product name in that package manager, and the version string in that package manager.

The reason that CPE is ultimately unworkable is the fact that creating a CPE name usually requires making arbitrary choices (e.g., “version 1.2” vs. “v1.2”), rather than only requiring information that can always be exactly verified by a user, Nobody can know for sure what choice was made by the person that created the CPE without doing a search of the CPE dictionary, and perhaps multiple searches using fuzzy logic or something like that.

(quoting Tom) “As long as you know the package manager (or source repository) that you downloaded an open source component from, as well as the name and version string in that package manager, you can create a purl that will always let you locate the exact component in a vulnerability database. This is why purl has literally won the battle to be the number one software identifier in vulnerability databases worldwide, and literally the only alternative to CPE.”

Unless… you end up having half a dozen PURLs for the same package, because it is available on a vendor’s page, GitHub, GitLab, Gitee, and every package manager out there.

And this is exactly the point about using purl in a vulnerability database: It only tells you what the CNA that created the CVE report with purl knows: the package manager, product name and version string of the software in which they found the vulnerability. The user can’t draw any conclusion about a product with the same name and version string in any other PM, unless the CNA that produced the report added purls for them as well (meaning they tested the same product and version in each PM). 

Who will maintain this epic list of PURLs? As of this blog, there are only 379 CNAs with tens of thousands of software companies out there. Not to mention the over one hundred million repositories on GitHub alone. While a PURL may be an open standard where CPE is not, it forces the community to set a PURL for every instance of the location of that software. That sounds like the big database you don’t think is viable?

Again, that’s the point of purl: no list is required. Any user can create the correct purl just from the three pieces of information they already know. As Steve Springett often says, every open source product in a package manager already has a purl – there’s no need to create it. 

(quoting Tom) “However, there is one big fly in the purl ointment: It currently doesn’t support proprietary (or “closed source”) software.”

And the other shoe drops. =) So, this is not a critique by any means, just highlighting the problems the community faces. The problems we faced 10 years have just compounded and here we are. Not that there were realistic solutions to all of these problems back then, and even if there were, we certainly didn’t address them then.

That’s correct. Currently, purl only covers open source software, although Steve Springett (who worked with Philippe to create purl, as mentioned in Philippe’s post that I linked above) points out that any online software “store” (Google Play, the Apple Store, etc.) could easily be made into a purl type, since the store controls the namespace of the proprietary products that are for sale in the store (just like a package manager controls the namespace of the packages in the PM).

In other words, what is needed is a controlled namespace, so one product will always have one name. Steve also suggested that SWID tags could be a more general way to identify proprietary software. He wrote the purl PR for a new identifier called SWID – which was adopted in 2022. See below. 

(quoting Tom) “I think this is a solvable problem, but it will depend – as a lot of worthwhile practices do – on a lot of people taking a little time every day to solve a problem for everybody. In this case, software suppliers will need to create a SWID tag for every product and version that they produce or that they still support. They might put all of these in a file called SWID.txt at a well-known location on their web site. An API in a user tool, when prompted with the name and version number of the product (which the user presumably has), would go to the site and download the SWID tag – then create the purl based on the contents (there are only about four fields needed for the purl, not the 80 or so in the original SWID spec).”

Unfortunately, I think at this point, this is a pipe dream. I am quite literally discovering new, well-known “standards” only by seeing them as requests ending in a 404 response in my web logs. So any such solution based on well-known I think isn’t viable now, and likely won’t be moving forward.

Please read what the OWASP SBOM Forum proposed regarding SWID on pages 11 and 12 of our 2022 paper. The point is that there needs to be some unique user-discoverable source of information on the product. Otherwise, the only alternative is to create (and maintain) a hugely expensive database of all proprietary software, along with the different product names and vendor names it was associated with through its lifetime – and that requires a huge number of very subjective judgments.

For example, if Product A from Vendor X is sold to Vendor Y who renames it Product B, is it the same product or not? If B is very different from A, you would just say it’s different. But if B is literally just A with a different name, you’d say it’s the same. Where do you draw the line between these two cases? There’s simply no way to do so.

There are certainly other ways that information on proprietary software could be made user-discoverable, so that no big secondary database (probably much larger than the vulnerability database itself) is required. One way is Steve Springett’s Common Lifecycle Enumeration project. That will take much longer to put in place than our SWID proposal, but IMO is ultimately the correct thing to do. If you have other ideas, we’d love to hear them. 

(Tom here) Of course, all of the above discussions are examples of the Naming Problem. There’s no question that this problem will be with us for a long time and will never be “solved” in any final way. However, the good news about the Global Vulnerability Database idea is that the naming problem doesn’t need to be solved first, precisely because the GVD won’t require “harmonization” of software names. 

The software will be named what it’s named in the vulnerability databases to which queries are routed; it will be up to the individual databases to continue their (presumably ongoing) efforts to improve their naming. If there's reason to believe there are serious naming problems in one vuln DB, the GVD might suspend routing queries to it. The GVD will be no more accurate than the individual DBs, but it won’t be less accurate, either. 

Any opinions expressed in this blog post are strictly mine and are not necessarily shared by any of the clients of Tom Alrich LLC. If you would like to comment on what you have read here, I would love to hear from you. Please email me at tom@tomalrich.com. Also, if you would like to learn more about or join the OWASP SBOM Forum, please email me.

My book "Introduction to SBOM and VEX" is now available in paperback and Kindle versions! For background on the book and the link to order it, see this post.

 

No comments:

Post a Comment