Monday, November 21, 2022

The purl in your future


In September, I announced that the SBOM Forum, the informal discussion group I started last March, had published a “proposal” to address the naming problem. This problem is one of the most important obstacles currently limiting use of software bills of materials (SBOMs) by end user organizations (although I pointed out, as I do frequently, that software developers are already using SBOMs heavily to learn about vulnerabilities in the software they’re building. This shows the benefits are real).

In this recent post, I described the problem, which centers on the CPE identifier that’s the basis for the National Vulnerability Database (NVD), using excerpts from our proposal. While there are six particular causes of the problem, their result is the same: Only a small percentage - probably under five percent - of component names found in an SBOM can be located through a search of the NVD, even using a properly-constructed CPE name. If a developer wishes to find more than this small percentage of names, they need to utilize various heuristic schemes like AI and fuzzy logic to increase the percentage.

The good news is that, by all reports I’ve heard, it’s possible to find a respectable percentage of component names in the NVD using these schemes. But the bad news is that having to do this means that, until at least a partial solution is implemented (and ours is the only proposal I’ve heard about, although I know CISA will be soliciting other proposals as well - something a government agency has to do), producing SBOMs is always going to require a substantial “manual” effort; this process can’t be fully automated until the component names produced by the automated process can in principle always be found in a vulnerability database like the NVD or OSS Index.

How does our proposal address this problem? It proposes that the NVD start accepting other identifiers for software and hardware besides CPEs – although there is no need to replace CPEs. Existing CPEs can be retained and new CPEs can be created. We are proposing that the NVD accept new identifiers for software and hardware. I’ll discuss our software naming proposal in this post, and our hardware naming proposal in a future post.

For software, we’re proposing that the NVD accept an additional identifier called purl, which stands for “package url”. While there is no way to know for sure, we believe that accommodating purl will address 70-80 percent of the software naming problem in the NVD. Purl is already in widespread use in the open source community; it is the identifier used in Sonatype’s OSS Index and most other open source databases.

Purl was developed primarily to address the problem that the “same” open source component can often be found in multiple repositories (package managers), and the code may be somewhat different in each one. Additionally, the same component might be found with different names in different package managers. Purl provides a name that is tied to the download location; if the user knows where they downloaded the component from and the name in that repository, they should always be able to create a purl that matches the one used by the component supplier (which is often an open source community, of course) when they reported a vulnerability for the component; thus, they should always be able to find the vulnerabilities reported for that component. That certainly can’t be said for CPE, or any other identifier that has to rely on a centrally maintained database.

Since 90% of software components are open source, purl was the natural choice to be the basis for the SBOM Forum’s proposal. However, as you’ll see if you read our whole proposal, we have utilized the unique properties of purl to incorporate names for proprietary (or “closed source") software into the proposal as well, utilizing SWID tags (I plan to discuss this in a post in the near future).

Purl is unique among identifiers in that it doesn’t require a centralized database. How is that possible? Here’s how we answer that question in our proposal:

Our solution is based on…the distinction between intrinsic and extrinsic identifiers. As described in an excellent article[1], extrinsic identifiers “use a register to keep the correspondence between the identifier and the object”, meaning that what binds the identifier to what it identifies is an entry in a central register. The only way to learn what an extrinsic identifier refers to is to make a query to the register; the identifier itself carries no information about its object.

A paradigmatic example of an extrinsic identifier is Social Security numbers. These are maintained in a single registry (presumably duplicated for resiliency purposes) by the Social Security Administration. When a baby is born or an immigrant is given permission to work  in the US, their number is assigned to them. The only way the person “behind” a Social Security number can be identified is by making a query to the central registry (which of course is not normally permitted) or by hacking into the registry.

By contrast, intrinsic identifiers “are intimately bound to the designated object”. They don’t need a register; the object itself provides all the information needed to create a unique identifier. What  intrinsic identifiers need is an agreed-on standard for how that information will be represented. Once a standard is agreed on, anyone who has knowledge of the object can create an identifier that will be recognizable to anyone in the world, as long as both creator and user of the identifier follow the same standard.

An example of an intrinsic identifier is the name of a simple chemical compound. As the article states, “We learned in high school that we do not need a register that attributes different identifiers to all possible chemical compounds. It’s enough to learn once and for all the standard nomenclature[2], which ensures that a spoken or written chemical name leaves no ambiguity concerning which chemical compound the name refers to. For example, the formula for table salt is written NaCl, and read sodium chloride.”

In other words, simply knowing information about the makeup of a simple chemical compound is sufficient to create a name for the compound, which will be understandable by chemists that speak any language - as long as you follow the standard when you create the name. NaCl refers to table salt, no matter where you are or what language you speak[3].

An example of an extrinsic identifier that is very pertinent to our proposal is a CPE name. CPE names are assigned by a central authority (NIST) and stored in a register. Whenever a user searches for a CPE name in the National Vulnerability Database, the register is searched to determine which entry  the name refers to.

Only CVE Numbering Authorities (CNA) authorized by NIST may create CPE names; currently, there are around 200 CNAs, mostly software suppliers or cybersecurity service providers. Organizations that wish to report a CVE (vulnerability), but do not themselves have a CNA on staff and are unable to identify an appropriate CNA to do this, may submit their requests to the MITRE Corporation “CNA of Last Resort”, which will identify an appropriate CNA to process the request.

In most cases, a software supplier applies to NIST for a CPE name for a product. When the name is assigned, the supplier can report vulnerabilities (each of which has a CVE name, assigned by CVE.org[4]) that apply to the product. Even though software products are named in many different ways as they are being developed and distributed, the product name used in the CPE is based on a specification[5] that has no inherent connection to the product itself.

The solution we are proposing for the software naming problem is based on an intrinsic identifier called purl[6], a contraction of “package URL”. As in the case of a simple chemical compound, knowing certain publicly available attributes of a software product enables anyone to construct the correct purl for the product. Moreover, anyone else who knows the same attributes will be able to construct exactly the same purl, and therefore find the product in a vulnerability database without having to query any central name registry (which of course does not exist for purl, since it is not needed).

Purl was originally developed to solve a specific problem: A software product will have different names, depending on the programming language, package manager, packaging convention, tool, API or database in which it is found.[7] Before purl was developed, if someone familiar with, for example, a specific package manager (essentially, a distribution point for software) wanted to talk about a specific open source product with someone familiar with a different package manager, the first person would need to learn the name of that product in the second package manager, assuming it could be found there. This would always be hard, because – of course – there is no common name to refer to.

This situation is analogous to the case in which an English speaker wishes to discuss avocados with someone who speaks both Swahili and English. The English speaker doesn’t know the Swahili word for “avocado”. To find that word, the English speaker uses an English/Swahili dictionary to learn that the Swahili word for avocado is parachichi. Neither avocado nor parachichi has any connection to an actual avocado – they are simply names. They are extrinsic identifiers, and there is no way to know that they refer to the same thing without a central register, which in this case is the dictionary.

However, what if the two speakers didn’t have ready access to an English/Swahili dictionary, either in hard copy or online? Most likely, the English speaker would find a picture of an avocado and show it to the Swahili speaker. The latter would smile broadly and say they now understand exactly what an avocado is. In fact, even if a dictionary were available, it would probably be easier for the English speaker to show the picture to the Swahili speaker. The picture is an intrinsic identifier; it is based on an attribute of the thing identified – in this case, what the thing looks like. Even if there were no English/Swahili dictionaries in existence, the picture would be a perfectly acceptable (in fact, preferable) identifier.

If you want to know how we envision purl working in the NVD, go to the discussion starting on page 10 of our proposal. In future posts, I plan to discuss how our proposal addresses proprietary software and intelligent hardware devices.

Any opinions expressed in this blog post are strictly mine and are not necessarily shared by any of the clients of Tom Alrich LLC. If you would like to comment on what you have read here, I would love to hear from you. Please email me at tom@tomalrich.com.


[3] More complex compounds may require a CAS Registry Number, which is a centralized database.

[7] The reader does not need to understand what all these items are, in order to understand the principle behind purl. In principle, nothing would be different if each of these items were a language, so the reader might think of the items simply as different languages.

No comments:

Post a Comment