In September, I announced
that the SBOM Forum, the informal discussion group I started last March, had published
a “proposal”
to address the naming problem. This problem is one of the most important obstacles
currently limiting use of software bills of materials (SBOMs) by end user organizations
(although I pointed out, as I do frequently, that software developers are already
using SBOMs heavily to learn about vulnerabilities in the software they’re
building. This shows the benefits are real).
In this
recent post, I described the problem, which centers on the CPE identifier that’s
the basis for the National Vulnerability Database (NVD), using excerpts from our
proposal. While there are six particular causes of the problem, their result is
the same: Only a small percentage - probably under five percent - of component
names found in an SBOM can be located through a search of the NVD, even using a
properly-constructed CPE name. If a developer wishes to find more than this
small percentage of names, they need to utilize various heuristic schemes like
AI and fuzzy logic to increase the percentage.
The good news is that, by all
reports I’ve heard, it’s possible to find a respectable percentage of component
names in the NVD using these schemes. But the bad news is that having to do
this means that, until at least a partial solution is implemented (and ours is
the only proposal I’ve heard about, although I know CISA will be soliciting
other proposals as well - something a government agency has to do), producing
SBOMs is always going to require a substantial “manual” effort; this process
can’t be fully automated until the component names produced by the automated
process can in principle always be found in a vulnerability database like the
NVD or OSS Index.
How does our proposal address this
problem? It proposes that the NVD start accepting other identifiers for
software and hardware besides CPEs – although there is no need to replace CPEs.
Existing CPEs can be retained and new CPEs can be created. We are proposing
that the NVD accept new identifiers for software and hardware. I’ll discuss our
software naming proposal in this post, and our hardware naming proposal in a
future post.
For software, we’re proposing that
the NVD accept an additional identifier called purl, which stands for “package
url”. While there is no way to know for sure, we believe that accommodating purl
will address 70-80 percent of the software naming problem in the NVD. Purl is already
in widespread use in the open source community; it is the identifier used in
Sonatype’s OSS Index and most
other open source databases.
Purl was developed primarily to
address the problem that the “same” open source component can often be found in
multiple repositories (package managers), and the code may be somewhat
different in each one. Additionally, the same component might be found with
different names in different package managers. Purl provides a name that is
tied to the download location; if the user knows where they downloaded the
component from and the name in that repository, they should always be able to create
a purl that matches the one used by the component supplier (which is often an
open source community, of course) when they reported a vulnerability for the component;
thus, they should always be able to find the vulnerabilities reported for that
component. That certainly can’t be said for CPE, or any other identifier that
has to rely on a centrally maintained database.
Since 90% of software components
are open source, purl was the natural choice to be the basis for the SBOM Forum’s
proposal. However, as you’ll see if you read our whole proposal, we have utilized
the unique properties of purl to incorporate names for proprietary (or “closed
source") software into the proposal as well, utilizing SWID tags (I plan
to discuss this in a post in the near future).
Purl is unique among identifiers
in that it doesn’t require a centralized database. How is that possible? Here’s
how we answer that question in our proposal:
Our
solution is based on…the distinction between intrinsic and extrinsic
identifiers. As described in an excellent article[1], extrinsic identifiers “use
a register to keep the correspondence between the identifier and the object”,
meaning that what binds the identifier to what it identifies is an entry in a
central register. The only way to learn what an extrinsic identifier refers to
is to make a query to the register; the identifier itself carries no
information about its object.
A
paradigmatic example of an extrinsic identifier is Social Security numbers.
These are maintained in a single registry (presumably duplicated for resiliency
purposes) by the Social Security Administration. When a baby is born or an
immigrant is given permission to work in
the US, their number is assigned to them. The only way the person “behind” a
Social Security number can be identified is by making a query to the central
registry (which of course is not normally permitted) or by hacking into the
registry.
By
contrast, intrinsic identifiers “are intimately bound to the designated
object”. They don’t need a register; the object itself provides all the
information needed to create a unique identifier. What intrinsic identifiers need is an agreed-on
standard for how that information will be represented. Once a standard is
agreed on, anyone who has knowledge of the object can create an identifier that
will be recognizable to anyone in the world, as long as both creator and user
of the identifier follow the same standard.
An
example of an intrinsic identifier is the name of a simple chemical compound.
As the article states, “We learned in high school that we do not need a
register that attributes different identifiers to all possible chemical
compounds. It’s enough to learn once and for all the standard nomenclature[2],
which ensures that a spoken or written chemical name leaves no ambiguity
concerning which chemical compound the name refers to. For example, the formula
for table salt is written NaCl, and read sodium
chloride.”
In
other words, simply knowing information about the makeup of a simple chemical
compound is sufficient to create a name for the compound, which will be
understandable by chemists that speak any language - as long as you follow the
standard when you create the name. NaCl refers to table salt, no matter where
you are or what language you speak[3].
An
example of an extrinsic identifier
that is very pertinent to our proposal is a CPE name. CPE names are assigned by
a central authority (NIST) and stored in a register. Whenever a user searches
for a CPE name in the National Vulnerability Database, the register is searched
to determine which entry the name refers
to.
Only
CVE Numbering Authorities (CNA) authorized by NIST may create CPE names;
currently, there are around 200 CNAs, mostly software suppliers or
cybersecurity service providers. Organizations that wish to report a CVE
(vulnerability), but do not themselves have a CNA on staff and are unable to
identify an appropriate CNA to do this, may submit their requests to the MITRE
Corporation “CNA of Last Resort”, which will identify an appropriate CNA to
process the request.
In
most cases, a software supplier applies to NIST for a CPE name for a product.
When the name is assigned, the supplier can report vulnerabilities (each of
which has a CVE name, assigned by CVE.org[4]) that apply to the product.
Even though software products are named in many different ways as they are
being developed and distributed, the product name used in the CPE is based on a
specification[5] that
has no inherent connection to the product itself.
The
solution we are proposing for the software naming problem is based on an
intrinsic identifier called purl[6], a
contraction of “package URL”. As in the case of a simple chemical compound,
knowing certain publicly available attributes of a software product enables
anyone to construct the correct purl for the product. Moreover, anyone else who
knows the same attributes will be able to construct exactly the same purl, and
therefore find the product in a vulnerability database without having to query
any central name registry (which of course does not exist for purl, since it is
not needed).
Purl
was originally developed to solve a specific problem: A software product will
have different names, depending on the programming language, package manager,
packaging convention, tool, API or database in which it is found.[7]
Before purl was developed, if someone familiar with, for example, a specific
package manager (essentially, a distribution point for software) wanted to talk
about a specific open source product with someone familiar with a different
package manager, the first person would need to learn the name of that product
in the second package manager, assuming it could be found there. This would always
be hard, because – of course – there is no common name to refer to.
This
situation is analogous to the case in which an English speaker wishes to
discuss avocados with someone who speaks both Swahili and English. The English
speaker doesn’t know the Swahili word for “avocado”. To find that word, the
English speaker uses an English/Swahili dictionary to learn that the Swahili
word for avocado is parachichi.
Neither avocado nor parachichi has
any connection to an actual avocado – they are simply names. They are extrinsic
identifiers, and there is no way to know that they refer to the same thing
without a central register, which in this case is the dictionary.
However,
what if the two speakers didn’t have ready access to an English/Swahili
dictionary, either in hard copy or online? Most likely, the English speaker
would find a picture of an avocado and show it to the Swahili speaker. The
latter would smile broadly and say they now understand exactly what an avocado
is. In fact, even if a dictionary were available, it would probably be easier
for the English speaker to show the picture to the Swahili speaker. The picture
is an intrinsic identifier; it is based on an attribute of the thing identified
– in this case, what the thing looks like. Even if there were no
English/Swahili dictionaries in existence, the picture would be a perfectly
acceptable (in fact, preferable) identifier.
For a more technical description of purl by Philippe Ombredanne, the creator of purl, go here.
If you want to know how we
envision purl working in the NVD, go to the discussion starting on page 10 of our
proposal. In future posts, I plan to discuss how our proposal addresses
proprietary software and intelligent hardware devices.
Any opinions expressed in this blog post are strictly mine and are not necessarily shared by any of the clients of Tom Alrich LLC. If you would like to comment on what you have read here, I would love to hear from you. Please email me at tom@tomalrich.com.
[3] More complex compounds may require a CAS Registry
Number, which is a centralized database.
[7] The reader does not need to understand what all these
items are, in order to understand the principle behind purl. In principle,
nothing would be different if each of these items were a language, so the reader
might think of the items simply as different languages.
No comments:
Post a Comment