Re: Re: [CVEPRI] Increasing numbers and timeliness of candidates
Scott Lawler said:
>(b) An acceptable level of noise is the low sustained level of
>inaccuracy often dealing with some possible duplicates and level of
>abstraction challenges. How many "errors" a month is too many? I'm
>not sure this level is possible to determine...let alone define,
>track, and meet to everyone's expectations.
I'd estimate that currently, about 3-5 duplicates are "caught" in
refinement for every 100 candidates, before a candidate is assigned.
And that's with a process that's designed to catch duplicates before
the refinement phase (that way, a content team member doesn't waste
time refining an issue that someone else is working on).
One of the most common causes of duplicates is alternate spellings.
Consider the following submissions that we might receive from 3
Submission A: "MyProduct vulnerability in CoolFeature"
Submission B: "My Product vulnerability in Cool Feature"
Submission C: "My Product vulnerability"
Submissions B and C match up, but they don't match A at all, except
for the "vulnerability" term, which is rather common.
So, one content team member does refinement on just group (A), and a
different member gets (B, C). For this case, B's short description
might do a better job of explaining the issue to the team member than
A or C does. So, one refiner might have an easier time of
understanding the issue well enough to apply content decisions. If
all 3 submissions are together, there is less effort.
The alternate spelling of "My Product" and "MyProduct" is key... and a
*very* common occurrence across different vulnerability databases.
These errors come from the original announcement of the vulnerability.
One information provider might catch the misspelling and fix it, while
others might use the original spelling. Sometimes, even the vendor
uses multiple spellings. (To facilitate search and lookup of CVE
names, we find the correct spelling and put the wrong spelling in the
keywords. While I'm on the topic, I suggest that vulnerability
databases consider this practice as well.)
OK, so we've now had team member 1 produce a "pre-candidate" for group
(A), and member 2 has produced a "pre-candidate" for group (B,C).
This produces a "refinement duplicate."
The references don't always help with matching, either. There may be
other keywords in the descriptions that help to match. So, we match A
against everything... then B against everything... then C against
everything. It's redundant, but matching goes much faster than
refinement does (and it requires less technical skill), so we reduce
the bottleneck of refinement. Due to some personnel transition
issues, in Fall 2001, we had a situation where some submissions were
not matched against other submissions, which effectively mimics what
would happen if we didn't match everything against everything else.
The result? A sharp increase in refinement duplicates.
I catch refinement duplicates when I am editing everyone's
refinements, *before* I assign a candidate number. But, I don't
always remember everything I've done (and every team member, including
me, has refined an A group and then a B,C group without remembering
that they already did it!). So, while I catch a number of dupes
during editing, some of them still might slip by.
And this an example of how, for example, Mark Cox just found that
CAN-2001-1227 and CAN-2001-1278 are duplicates.
It's also possible that submission A only describes one bug, whereas B
describes 1 bug, and C describes 2 bugs. (Actually, this is pretty
typical). So, abstraction errors can creep in, too; I usually catch
these during editing and clean them up, too.
Based on content decision statistics, 15% of all candidates are
affected by abstraction CDs. That means that up to 15% of all
reported issues could be given the wrong level of abstraction,
depending on the amount of information available.
My editing task also includes things like modifying descriptions to
better fit the CVE "style," ensuring that content decisions have been
applied correctly, making sure that the analysis section includes the
appropriate information (e.g. which line in a change log can be proven
to indicate vendor acknowledgement), etc.
There is one last chance to catch refinement duplicates, and that is
*after* the numbers are assigned, but *before* the candidates are
proposed to the Board (or at least published on the CVE web site).
This involves matching all the newly created candidates against each
other. Content team members will have placed alternate spellings in
the keywords, and duplicate candidates will share many of the same
references, and the references will have been CVE-normalized. I
usually skip this step due to time factors; otherwise the
CAN-2001-1227/CAN-2001-1278 dupe would have been caught easily. In
the few times I've had the time to do this, I've reliably caught a few
In recent months, we have begun enhancing the submissions by
automatically extracting the references (where possible) and
normalizing them to the CVE-style format. For example, a URL to a
Bugtraq post is followed, and a CVE-style reference is made that
includes the date and subject line. Same thing for a URL to a vendor
advisory. The matching algorithm also matches on references. So,
these normalized references help avoid duplicates - and they also save
refiners time, because they don't have to convert URL's or "loose
text" to the CVE style references.
This approach cuts down on duplicates in the longer run, because no
matter how many ways someone can spell a product name, the vendor
advisory ID or original Bugtraq post is the same, and the submissions
are more likely to be matched together.
If we modify our processes to skip some of these steps, then
refinement duplicates will not be caught as often. If people are
putting candidates into their databases and products sooner, then the
"bad" duplicate will likely stay in those databases longer, which
prevents users from linking between multiple sources for those
vulnerabilities. (I believe that there is good evidence that this
Note that I'm skipping all the difficulties in determining vendor
acknowledgement (which would be addressed in the long term by
responsible vulnerability disclosure on the part of vendors and
researchers), or the detailed types of questions that cause even
vendors to scratch their heads.
Will The Editorial Board Always Catch Duplicates While Voting?
No. That has already been demonstrated a few times, unfortunately.
But nobody's perfect, and it's possible that a Board member sees and
votes on one candidate but not the other, so it is to be expected that
sometimes a duplicate candidate becomes a duplicate entry.
What Can Be Done About It?
Getting more candidates into more vendor and researcher advisories,
sooner... which argues for more vendors and researchers using
Alternately, getting candidate numbers into CVE's data sources, before
CVE uses those data sources, which immediately brings chickens and
eggs to mind.
Either approach would be facilitated by increasing the number of
CNA's, which requires "training" with respect to content decisions and
process changes (some of this training is already going on behind the
And the final punchline: the best solution may be a very small,
closely coordinated group of individuals or organizations across the
industry, who are dedicated to producing candidate numbers quickly,
who are in the business of producing vulnerability data *very* fast
and on a *very* large scale, and who are willing and able to put in
some time daily since vulnerabilities never sleep, and who are able to
apply CVE's content decisions consistently regardless of how they do
things in their own databases, and who are experts in vulnerability
analysis across a variety of software or platforms, and who are
reasonably good at writing terse descriptions.