I recently described
my idea for a Global Vulnerability Database. The GVD won’t be a database at
all, but rather an “intelligent switching hub” that accepts vulnerability queries
that are in the form:
“What Vulnerabilities are
found in Product ABC?”, or
“What Products are
affected by Vulnerability 123?”
The Product and Vulnerability fields are both intended to be
as universal as possible; that is, they should accept all major
machine-readable identifiers. For example, the Vulnerability field will accept
CVE, OSV, GHSA (GitHub Security Advisory), and other vulnerability identifiers.
The Product field will accept CPE, purl, OSV, and perhaps other product
identifiers.
While this was not always the case, it is safe to assume
that today there is no major vulnerability database that does not accept and/or
output machine readable vulnerability identifiers, product identifiers, or
both. However, in this regard there are two important differences between the
GVD and other vulnerability databases:
1.
With one notable exception[i],
it is unlikely there is any vulnerability database today that, in response to a
query for vulnerabilities that affect Product ABC, will provide more than one
type of vulnerability identifier - for example, both CVE and GHSA. Moreover,
with the same exception, it is unlikely there is any vulnerability database
today that, in response to a query for products that are affected by a
particular vulnerability (e.g., CVE-2025-12345), will provide more than one
type of product identifier, e.g. purl and CPE. This is because most
vulnerability databases are designed to associate a single type of product
identifier with a single type of vulnerability identifier. For example, the NVD only associates CPE names for products with
CVE numbers for vulnerabilities; the OSS
Index open source database only associates purl identifiers with CVE
numbers; etc.
2.
It is also safe to say there is no vulnerability
database today that will respond to a query like “Show me vulnerabilities of
all types that affect Product ABC”, by displaying all major types of
vulnerability identifiers. It’s also safe to say there’s no vulnerability
database today that will respond to a query like, “Show me products of all
types that are affected by CVE-2025-12345”, by displaying all major types of
product identifiers. Yet, my ambition is that the GVD will do both of those
things.
However, there is a potential fly in this ointment: There is
no way to create an unambiguous mapping either between different types of
vulnerability identifiers (e.g., CVE to OSV) or different types of product
identifiers (e.g., CPE to purl). Here are several examples:
A. Most vulnerabilities are assigned to products as
part of a coordinated vulnerability disclosure process. For example, an open
source project (“Project 1”) might report a new vulnerability they have
identified in their product to the CVE Program. A CVE Numbering Authority (CNA)
will create a new CVE record for the vulnerability and assign it a CVE number
like CVE-2024-56789. If the project team also registers the new vulnerability
with GitHub, it will receive a GHSA identifier as well. Given that the same
team is responsible for both registrations for the vulnerability (CVE and GHSA),
the two registrations will usually be considered to identify the same
vulnerability.
B. However, if a separate open source project registers
a similar vulnerability as a GHSA and asserts it is the same as the vulnerability
described in CVE-2024-56789, this assertion may meet with skepticism in the CVE
Program, since the two registrations were not by the same team. Since there is
no easy way to resolve a dispute like this, the only safe policy is to accept
two registrations as being for the same vulnerability only if they were both
created by the same organization or person. If that is not the case, the two
registrations need to be considered different vulnerabilities.
C. Libraries are widely used by both open source and
commercial developers. Usually, a vulnerability will be present in just one
module of a library, not all of them. However, since CPE names identify the
product that contains the vulnerability and the library itself is the product,
this means a CPE name will not usually refer to the vulnerable module[ii].
By contrast, purl (“package URL”) identifies a package.
Since each module of a library is its own package, this makes it possible to
identify the location of a vulnerability with much more precision.[iii]
Thus, there can be no CPE “equivalent” of a purl that references a single
library module.
To produce this blog, I rely on
support from people like you. If you appreciate my posts, please make that
known by donating here. Any amount is welcome, but I consider anyone who donates
$25 or more per year to be a subscriber. Thanks!
The primary lesson to be drawn from
the above examples is that, because there are so many reasons why one type of vulnerability
or product identifier will not be “translatable” to another type, it would be a
bad idea to try to “harmonize” the identifiers into one type – for instance,
make purl the “universal” product identifier or CVE the “universal”
vulnerability identifier, with all other identifiers “translated” to one or the
other. On the other hand, if it might benefit a vulnerability database user to
learn about a vulnerability or vulnerable product that is like the one included
in the response to their query, the GVD will usually provide both the exact and
the similar match.
This means that, even though the
user will usually enter a straightforward query that lists just one or two
product identifiers, the response will not necessarily be limited to the same
identifiers. The GVD will always assume that the user is interested in seeing
as much relevant information as possible, even if they end up discarding some
of what they are shown.[iv]
Here are two examples of how a
single query might work:
Query 1: “What current vulnerabilities have been identified in the
open source project Django version 5.2?”
The query is parsed into three queries
to three vulnerability databases:
·
To the NVD: “What vulnerabilities affect Django version 5.2?” The
response to this query is this list of four CVE numbers. Each of those can be queried
separately for more information on the vulnerability.
·
To GitHub Advisory
Database (GAD): “What vulnerabilities
affect Django version 5.2?” The response to this query is this list of two CVE numbers, which are both included in the
NVD response. The first of the two CVEs corresponds to the GitHub ID GHSA-7xr5-9hcq-chf9, which can be searched on separately. The second CVE
corresponds to GHSA-8j24-cjrq-gr2m, which can also be searched on separately.
·
To Sonatype OSS
Index: “What vulnerabilities apply to purl
pkg:pypi/django@5.2?”[v] The response to this query
is this list of two CVEs. These are the same CVEs shown by the
GitHub Advisory Database. However, clicking on either of the CVE lines provides
additional information not provided by either the NVD or GAD.
All three results will be provided
to the user, as well as results from queries to any other vulnerability database
like OSS Index or OSV, if different results are obtained. Note that, while the
NVD and GAD queries are identical, the OSS Index query uses the purl for Django
v5.2.[vi]
Query 2: “What products are affected by CVE-2021-45046?”
The query is parsed into two
queries to two vulnerability databases:
·
To the NVD: “What products are affected by CVE-2021-45046?” The response to this query identifies twelve “Known affected software
configurations”, which among them list over 50 CPE names.
·
To GitHub Advisory
Database: “What products are affected by
CVE-2021-45046?” The response to this query illustrates the fact that there is not
always a list of machine-readable software identifiers available. The primary
feature of this page is the set of references – security advisories by various
developers and manufacturers, including patch URLs. These references need to be
parsed “manually”.
Of course, even though the response from the NVD includes machine readable software identifiers and the response from the GAD does not, that doesn’t mean the two responses should not be displayed together. Both responses provide a set of references; it is unlikely that the two sets are identical. Since most queries about CVE-2021-45046 are probably motivated by a search for a patch (this is one of the vulnerabilities associated with the log4shell vulnerability in the log4j library), users will want to see as many references as possible.
The moral of this story is that a
query to the Global Vulnerability Database will usually yield multiple
responses. These will include
1.
Responses from
databases other than the one originally intended in the query, as well as
2.
Responses generated
from queries using identifiers that are similar to, but not the same as, the
identifier used in the query.
Of course, the additional queries will not be generated by some mechanistic process, but rather by an intelligent process that will run in the “front end” of the GVD. Does this mean that the front end will run a large language model created by generative AI? No. My opinion (which I’ll be glad to discuss with anybody who thinks differently) is that the decisions on alternative queries in the GVD need to be based on a set of identifiable rules that can be audited.[vii]
If you would like to comment on what you have read here, I would love to hear from you. Please email me at tom@tomalrich.com. And please donate as well!
[ii]
In some cases, the person who creates the CPE name creates a “product name”
that includes the names of both the library and the vulnerable module. However,
there is no consistent procedure for doing this, so it cannot be used for an
automated response.
[iii]
Because software developers often do not install library modules that are not
directly used by their product, this means that a lot of patches for libraries are
issued and applied needlessly, since the vulnerable module was never included
in the product in the first place. This was the case with the log4shell
vulnerability in the log4j library.
Log4shell affected just the log4core module, meaning any
developer that had not installed that module didn’t need to patch the library.
However, since vulnerability advisories that referred to the CPE name (and thus
only designated the log4j library as vulnerable, not the log4core module) didn’t
capture this subtlety, many developers probably fell into this category.
[iv] Since
some users will not be interested in seeing close matches, a GVD user will be
able to suppress display of any match except an exact one. In that case, the
output they receive will be close to what they will receive from a search on a
single database.
[v] A
purl can be easily created using a simple formula and information that a user
should have readily available (or else be able to find quickly). In this case,
the user just needs to know the package name, version number, and the repository
from which they downloaded the package. The repository (known as the purl
“type”) is PyPI, which stands for Python Package Index.
[vi] Every purl has a “type” that usually indicates the repository from which the software was downloaded. The purl in this example has the type “pypi”, which refers to PyPI, the Python Package Index. If Django is not available in other repositories than PyPI, this means there is only one possible purl to use in a search for Django in OSS Index. However, if Django were available in other repositories (e.g. package managers), each of those could be used for a separate search in OSS Index, by simply replacing “pypi” with the type for the other package manager and then re-running the search.
While it might seem odd to search the same
vulnerability database three times for the same product name and version number,
there is a good reason for doing this: There can be no assurance that a
vulnerability that applies to a particular product/version in one package
manager will also apply to the “same” product/version in a different package
manager. In other words, purl treats products with the same name and version
number as different products if they are found in different repositories.
[vii] This
is like an early type of AI called “expert system”. These systems were literally
created by interviewing an expert in a certain process (e.g., operation of a
machine in a manufacturing plant) and codifying their advice into a set of
rules. A simulation of the process would then be run, governed by these rules;
the rules would be iteratively tweaked to improve the outcome of the process. After
the process was running smoothly in the simulation, the rules would then be
tested on the physical process itself.
The most important aspect of this procedure was that
any change in the rules could be audited. If a rule was changed but that didn’t
improve the process, the change would be backed out and a different change
would be tried.