Friday, April 11, 2025

Databases, all the way down

 

I have been attending VulnCon 2025 remotely this week, although not all the sessions. Even though the first conference was last year, VulnCon has clearly found its niche as the premier gathering place for people interested in or involved with vulnerability management. The conference is well designed and well executed.

The sessions I’ve been attending are those that have to do with software naming in what I call the “CVE ecosystem”, but which most people think of as the National Vulnerability Database (NVD). If you have been reading my recent posts, you know that:

1.      Learning about a software vulnerability isn’t very helpful if you don’t know what products are affected by it; ideally, you want to be able to search on a product name in a vulnerability database and immediately be shown all the vulnerabilities that have recently been identified in that product.  Moreover, since CVE is by far the most widely cited vulnerability type and there are now over 280,000 CVEs in the official list, affected products need to be referred to using a machine-readable software identifier. The only identifier currently supported by CVE.org (the organization funded by DHS that creates and manages CVE Records) is CPE, which stands for Common Platform Enumeration.

2.      When a CVE Numbering Authority (CNA), working for CVE.org, produces a CVE Record to report a new software vulnerability, they do not usually include a CPE name(s) to refer to affected products listed in the text of the record. The reason for this is that the NVD[i] has always wanted to be in control of CPE creation. This didn’t previously cause a big problem, since until last year, the NVD almost always created a CPE for every affected product described in the text of a CVE Record; they did this within a few days of receiving the record from CVE.org.

3.      However, starting on February 12, 2024, the NVD drastically slowed their production of CPE names, for a reason that has never been clearly explained. This has produced an ever-growing backlog of CVE Records without a CPE name. Despite several promises that they would fix the problem by a certain date, the backlog has continued to grow. Today, the backlog stands at well over 40,000 CVE Records (although a well-known vulnerability researcher estimated in the VulnCon chat that the backlog is now 52,000 records). Of course, this is far more than 50% of the total new CVEs identified since February 2024. The NVD no longer even talks about eliminating the backlog for good. My guess is they would be happy just to stop it from growing, but even that doesn’t seem likely now.

4.      Why is it bad that so many CVE Records don’t contain CPE names? It’s bad because a CVE Record without a CPE name is invisible to an automated search of the NVD. If a user of Product ABC wants to learn what vulnerabilities (CVEs) are currently present in that product, they might enter “Product ABC” in the search bar of the NVD. The user should see every CPE name that contains that text string. The user can determine which of those CPEs matches the product they use; then they can search for CVEs that apply to that CPE.

5.      However, if there are no CPE names that contain the text string, the user will receive the message, “There are 0 matching records.” The user will receive this message even if there is a CVE Record that states in its text that Product ABC is affected by the vulnerability, as long as that record doesn’t include Product ABC’s CPE name. The lack of the CPE name in the record means that searching on a CPE name will not inform the user that their product is affected by the vulnerability described in that record.

6.     But there’s a worse problem than not learning about vulnerabilities that affect the product being searched for: The above message is the same one that the user will receive if the product in fact has no identified vulnerabilities. Human nature alone dictates that most users will interpret the message this way. That is, most people will believe the product they use has no vulnerabilities, when in fact it may have a lot of them.

In my opinion, everyone in the CVE ecosystem needs to assume that CPE will never be a reliable identifier, even though nobody is saying that CPE should go away. What’s Plan B? Plan B is purl, which has come from literally nowhere eight years ago to being one of the two or three most widely used software identifiers in the world. However, purl cannot currently be used in CVE Records, so people in the CVE ecosystem currently cannot benefit from using it.

This is why I’m pleased to announce that purl will soon (let’s say in 6-9 months) be available in the CVE ecosystem. I’ve been advocating for purl for more than two years; interest in it has clearly been growing, but the day when it would become an officially accepted part of the CVE ecosystem has always seemed far away. Now, I can say with confidence that CNAs will be able to identify vulnerable products in CVE Records – and end users will be able to search for them – using purl within a year, and perhaps less than that.

Purl was discussed in at least four different sessions at VulnCon, but perhaps the most interesting was a two-hour workshop led by Chris Coffin of MITRE, leader of the CVE Quality Working Group, and Pete Allor, Senior Director of Product Security at Red Hat (both of them are members of the CVE.org Board, which runs the CNA Program within DHS). When the idea for the workshop first came up early in the year – it was primarily the brainchild of Christopher Robinson, aka “CRob”, of the Linux Foundation - the point of the workshop was to have a kind of “face-off” between purl and CPE.

At that time, the question was whether there was enough support for purl in the CVE community for the CVE Board to seriously consider moving forward with it as a second possible software identifier along with CPE. The point of the workshop was to get a “sense of the room” on this subject.

However, I was surprised (and others were, too) by the fact that in the past one or two months, the CVE Program has decided to at least start laying the groundwork for incorporating purl in the CVE Record Format. How did this change come about? While I have no specific knowledge of the reason, I attribute it in large part to the fact that in March it became clear that the NVD was not only not making progress on eliminating their backlog of CVE Records without CPE names, but they were in fact allowing it to grow at a much more rapid pace. Indeed, at the end of March, I was told that the backlog had grown from 55% of CVE Records issued since February 12, 2024 – its size at the end of 2024 – to over 70%.

In other words, searching the NVD for new vulnerabilities applicable to a software product has increasingly become an exercise in futility: You will most likely just get a message saying, “There are 0 matching records.” If you want a lift to your day, you can believe that means your product has zero vulnerabilities and you have nothing to worry about. Or if you want to be realistic, you can say this more likely means that any CVE Record that mentions the product you are searching for in its text does not include a CPE name for the product. If you want to verify this for yourself, you can always read the text of each of the 40,000 new CVE Records added to the NVD since February 12, 2024.

The CVE Program intends to change the CVE Record Format (the format used by CNAs to create CVE Records) to enable CNAs to use purl to identify a vulnerable software product, not just CPE. You might ask why that is such a big deal. After all, if the NVD is struggling to create CPE identifiers, why won’t they also struggle to create purl identifiers?

The answer is that purl identifiers don’t need to be “created”. Today, purl is mainly used to identify open source software distributed in package managers and similar repositories (of course, this includes a huge percentage of open source software products, especially of software components found in SBOMs). A typical purl is: “pkg:pypi/django@1.11.1”. The values of the fields in this purl are:

“pkg” – This field does not currently have a use, but it will in the future. Currently, all purls start with these three letters.

“pypi” – This is called the purl “type”. The package manager is designated in the type. In this case, the package manager (or more correctly, the package index) is PyPI.

 “django” – This is the product name in that package manager.

“1.11.1” – This is the version number (or “version string”) in that package manager.

If you are a CNA creating a new CVE Record that reports a vulnerability found in django v1.11.1 as it exists in PyPI, you can easily create the purl using the values for those four fields. If you’re not sure about one of the fields (e.g., you’re not sure about the spelling of django), you can verify it by checking in PyPI. Similarly, if you’re a user of django and want to learn about current vulnerabilities found in that product/version, you can look at the product itself, or else verify the information in PyPI.

The most important feature of this process is that the purl for django 1.11.1 as found in PyPI will always be globally unique. There are some open source products, like OpenSSL, that exist in multiple package managers, so the name and version string might be the same for all those instances. However, the package manager will be different in each instance. This means every purl is guaranteed to be globally unique.

By contrast, CPE names include at least two fields that are inherently ambiguous: product name and vendor name. Everyone knows that products are renamed regularly, due to M&A as well as various marketing and rebranding campaigns. But even the company name is hardly unambiguous. A consultant who worked at Microsoft once asked people there what company they worked for; she received over 20 different answers. This is compounded by the fact that software identifiers are based on a single spelling of a name, so “Microsoft, Inc” is different from “Microsoft”, which is different from “Microsoft, Inc.” with a period, etc.

The NVD mostly leaves it up to a staff member – usually a contractor – to decide what values to include in the product name and vendor name fields of a CPE name they are creating. It is likely that the only direction they give the contractor is to adhere as closely as possible to existing values in the “CPE Dictionary” (which isn’t a dictionary at all, but simply a list of every CPE ever created). Of course, the product and vendor names vary greatly in the “dictionary”, even when they probably refer to the “same” product or vendor. So, the CPE dictionary is a very week reed to lean on.

In discussions about this problem (which is the infamous software “naming problem”, unless you didn’t realize that), someone always asks, “Why don’t we just build a database of all software products and/or all software vendors? That database can have a canonical name for each product or vendor; every staff member creating a new CPE name will need to adhere as closely as possible to similar names that are located near it in the database.

That idea sounds attractive until you start thinking about it. Then you quickly realize:

1.      Creating, and even more so maintaining, a database like that would be fantastically expensive – many times the cost of maintaining the NVD itself. Remember, the database will include not just big- or medium-sized software companies, but one-person shops that ship a single product. These will have to be tracked all the time for name changes, acquisitions, etc.[ii]

2.      As my friend the consultant found out, there is no agreement on either product or vendor naming among employees of a large software company. Who will oversee decisions regarding canonical names? Since I’m sure there’s no employee at Microsoft that even knows every product they make (let alone can track all the changes in product names), it’s not likely one person, or even one department, can make that decision. The decision will have to be delegated. How will that be done, and what criteria will be provided for the people that make these decisions? Just developing training for these people – which will have to be constantly repeated, of course – will be a monumental task.

3.      I will point out one area of agreement that I’ve found in these discussions: The person who advocates for an approach like this will usually end up saying their department should oversee software naming, because they are the only department with the right perspective to make these decisions. This is expected behavior, since there’s probably no objective way to decide who should oversee software naming.

To summarize the above, trying to definitively fix CPE name creation will usually lead to requiring at least two separate databases: for software and vendor names, respectively. I don’t know of any other way that it would be possible to enforce a policy like, ‘Any software developer whose name begins with the word “Microsoft” will be called “Microsoft Corporation” (and not “Microsoft Corp.”, “Microsoft, Inc.”, etc.).’

How does purl handle the naming problem? The name of an open source product in a package manager is controlled by the operator of the package manager; whatever name they decide on is the correct one for that package manager, although another package manager may decide to give the “same” product a different name. Moreover, it’s likely those two databases will themselves require other databases. After all, if a company like Microsoft is going to designate certain people to oversee naming for certain types of software, there will need to be a database that lists each of those people, as well as the types of products over which they have authority. And that database might itself require another database, etc.

How does purl decide the “correct” name for a software product found in a package manager? It follows a simple rule: the name of the product in the package manager is presumably under the control of the operator of the package manager. That person or organization can be counted on to maintain a “controlled namespace”, in which no product name/version string combination duplicates the name/version of another product in the same package manager.

That way, the name of a product distributed through PyPI or Maven Central will always be the same for anyone who wants to look at the package manager (or even read the “About…” section on the main page of a software product they use); no centralized database lookup is required. Two different people (say, the CNA that creates a CVE Record that includes a purl for Product ABC version 1.2 and the user who wants to search for vulnerabilities in that product/version) should always, barring a mistake, create the same purl.

Problem solved.

If you would like to comment on what you have read here, I would love to hear from you. Please email me at tom@tomalrich.com.

My book "Introduction to SBOM and VEX" is available in paperback and Kindle versions! For background on the book and the link to order it, see this post.


[i] The National Vulnerability Database is part of NIST, which is part of the Department of Commerce. The CVE.org organization, which used to be called MITRE and is still staffed by contractors from the MITRE Corporation, is funded by the Department of Homeland Security (DHS).

[ii] Steve Springett is advocating an idea called “common lifecycle enumeration”. This can be thought of as an online ledger of changes in names and versions of a software product.

Friday, April 4, 2025

It’s time to start discussing the Global Vulnerability Database

For a year and a half, I’ve been talking about what I call the Global Vulnerability Database (GVD) – which isn’t a single database at all, but rather a federation of databases behind a single intelligent front end. At first, I just brought it up as one of those things that will take many years to come to fruition, but which is nice to think about, anyway.

That changed in February 2024, when the National Vulnerability Database (NVD) greatly slowed down performance of their most important function: adding machine-readable software identifiers called CPE names to new CVE records. Like a lot of people, I initially believed the NVD’s assertions that they were getting back on track and would soon resolve their backlog of over 30,000 CVE records that were “unenriched” – i.e., that did not include a CPE name(s) for the vulnerable product(s). This deficiency renders those records invisible to automated searches (such as from the NVD’s command line).

However, those assertions have all fallen by the wayside. On March 19, the NVD announced, “…we are working to increase efficiency by improving our internal processes, and we are exploring the use of machine learning to automate certain processing tasks.” In other words, they’re no longer even pretending they’ll eliminate their backlog, or even stop its growth, anytime soon.

Meanwhile, the percentage of CVE records the NVD is enriching[i] in a timely fashion has probably fallen to around 25% from 45%, where it was at the end of 2024. Moreover, this week I started to hear complaints about “503” (“service unavailable”) errors when people tried to reach the NVD (I experienced six of those in one day, vs. only one success). It’s no exaggeration to say that the NVD may be falling apart in real time.

There are two problems here, with separate causes. The problem with the lack of CPE names is in some way due to cuts in funding by another agency. NIST (the NVD’s parent organization) checked between the couch cushions and found them some extra cash last year, yet if anything the problem has gotten worse since they did that. The solution to the CPE problem (although it will take a couple of years before it’s fully in place) is to make it possible for CVE records to include purl identifiers, not just CPEs. That solution doesn’t depend on the NVD, but rather on CVE.org. I trust that organization much more than I do the NVD, although I’m worried about what might happen to them as well, given the current political climate.

However, the 503 errors aren’t primarily due to funding. They’re due to the NVD having a decrepit infrastructure with at least some non-redundant systems, as well as programs written in an old language that even I – old as I am – had never heard of. The only good solution to this problem is to bring up a huge dumpster to the building and gently deposit all their current servers, etc. in said dumpster. The NVD doesn’t have to worry about backing up their data, since the entire NVD can be downloaded in about ten minutes (I’m not kidding); thousands of complete backups of the NVD are made every day.

The only solution to the NVD’s problems is to replace it. But simply modernizing the hardware and software isn’t enough. Instead, we (meaning the worldwide software security industry) need to get together to decide what objectives we need to achieve in replacing the NVD. Here are my ideas on that topic:

1.      Access to the GVD needs to be free, at least for casual users. Users that regularly download large amounts of data using an API might need to be charged, although doing that could possibly cause more problems than it would solve.

2.      The GVD must be truly global and not under the control of any one government. The budgets and priorities of government agencies often fluctuate due to political decisions, with the needs of their users - their real “customers” – taking a distant second place.

3.      The GVD should be run by a global nonprofit set up something like the Internet Assigned Numbers Authority. IANA runs DNS and assigns IP addresses worldwide. It performs these tasks smoothly and efficiently. The fact that most of us use DNS hundreds (or even thousands) of times per day without even thinking about it shows this model can work well.

4.      Support for the GVD should come from both governments and private for-profit and nonprofit organizations. Governments that want to participate (and are not considered to pose a security threat to the database or its users) might pay an assessment based on GDP, population, etc.

5.      The GVD should be as comprehensive as possible. That is, it should include all major types of vulnerabilities (CVE, OSV, GHSA, etc.), as well as all major software identifiers (mainly CPE and purl). Achieving this goal will require creation of a federation of individual databases that already exist.

6.      This federation will be enabled by an “intelligent front end”. This will analyze all requests and interpret them for one or more databases. For example, a request for vulnerabilities that affect an open source product designated with a purl identifier could be routed to open source databases that support purl, like OSV, OSS Index, GitHub Security Advisories, etc. The responses might refer to different types of vulnerabilities (e.g., CVE, OSV, and ICSA), but they would all be based on purl.

7.      The individual databases will continue to be accessible directly – i.e., not going through the intelligent front end.

8.      There will need to be some mechanism by which vulnerability databases that are currently “for charge” will be able to charge for their services, perhaps on a per-transaction basis. Of course, the user will need to be warned when their search is about to be routed to a for-charge database.

9.      Responses will usually be routed back to the user without change, since it is usually impossible to “harmonize” different types of vulnerabilities. The exception to this rule is cases in which the person reporting the vulnerability (often a CVE Numbering Authority) has assigned it two identifiers, e.g. a CVE and an ICSA.

10.   The NVD does not need to go away, since it is the primary “custodian” of CVE records that include CPE identifiers for vulnerable products. Even when the NVD supports purl, many (and initially most) CNAs will continue to include CPE names in their new CVE records.

11.   CVE.org will continue to operate as a separate organization, since CVE records are widely used worldwide, not just in the NVD. However, it would be better to move CVE.org to an international organization, perhaps another IANA division that is separate from the GVD. To give CNAs more incentive to include items like CVSS scores and purl/CPE identifiers in their CVE records, it would be better if the CNAs could be paid in some way.

We might start having discussions of this in the bi-weekly SBOM Forum meetings, including today at 1 AM Eastern Time. If you would like to join us, please send me an email. 

f you would like to comment on what you have read here, I would love to hear from you. Please email me at tom@tomalrich.com.

My book "Introduction to SBOM and VEX" is available in paperback and Kindle versions! For background on the book and the link to order it, see this post.


[i] When looking at the NVD’s statistics, it’s important to keep in mind that they have changed their definition of “enrichment” of a CVE record. It used to include adding a CPE name, as well as other items like CVSS score. However, they seem to have conveniently dropped the requirement for a CPE name to be added, for it to be enrichment. While it’s good to have a CVSS score as well, CPE (or hopefully purl in the not-so-distant future) is by far the most important element that needs to be added to a CVE record.