Why have apparent memory bitflips in non-ECC memory not increased?Freezing DRAM for forensics (coldboot)How does SDRAM refresh interact with ECCWhy does flash memory have a lifespan?PIC32 does not have Memory Managment Unit, why would a high end microcontroller not have such a peripheralFlash memory data retention timeWhy use a buffer in/output for a RAM memory?Difference between a memory cell and a memory chip?Why does this RAM component have unpredictable behavior in Multisim?DDR3 Data Errors
How acceptable is an ellipsis "..." in formal mathematics?
What ways are there to bypass spell resistance?
What is the fastest way to move in Borderlands 3?
How can I remove rest of file from string for all files?
Shieldgate - Tied Village Clusters
Accounting for intervals not present within the chord ratio?
Should a grammatical article be a part of a web link anchor
Can something have more sugar per 100g than the percentage of sugar that's in it?
Is there a more efficient alternative to pull down resistors?
Sanitise a high score table
Can massive damage kill you while at 0 HP?
I am confused with the word order when putting a sentence into passé composé with reflexive verbs
one-liner vs script
Why didn't Trudy wear a breathing mask in Avatar?
How to find an internship in OR/Optimization?
Should I reveal productivity tricks to peers, or keep them to myself in order to be more productive than the others?
Reduction of carbamate with LAH
Is having your hand in your pocket during a presentation bad?
Is there such thing as plasma (from reentry) creating lift?
Why has Donald Trump's popularity remained so stable over a rather long period of time?
Is there a simple way to apply a function to the RHS of a substitution?
Had there been instances of national states banning harmful imports before the mid-19th C Opium Wars?
Relation between signal processing and control systems engineering?
Does the Creighton Method of Natural Family Planning have a failure rate of 3.2% or less?
Why have apparent memory bitflips in non-ECC memory not increased?
Freezing DRAM for forensics (coldboot)How does SDRAM refresh interact with ECCWhy does flash memory have a lifespan?PIC32 does not have Memory Managment Unit, why would a high end microcontroller not have such a peripheralFlash memory data retention timeWhy use a buffer in/output for a RAM memory?Difference between a memory cell and a memory chip?Why does this RAM component have unpredictable behavior in Multisim?DDR3 Data Errors
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty
margin-bottom:0;
$begingroup$
Back in the early 2000s I remember asking about why it was so important that servers use ECC memory. The prevailing wisdom at the time was that systems with lots of RAM would be, statistically, more likely to suffer bitflips. This makes sense - if each cell has a 10-20 probability of suffering a bitflip per second, then 109 cells have a 10-11 probability per second. The more cells you have, the higher the probability of a bitflip in any given time period.
Back then we would be looking at a ballpark of 128MB to 1GB of RAM. These days we regularly put 16GB or more in laptops and desktops, with workstations commonly having 64GB or more. For argument's sake, let's say we've increased total RAM amounts by two orders of magnitude. So we should see a hundred times or so more bitflips, on average, in any given system, assuming that nothing else changed.
The more I thought about it, though, the more I realised that the random bitflip rate should be much higher in newer systems:
- Lower operating voltages means lower distinction between a 0 and a 1.
- Lower gate charge means less overall energy required to flip a bit.
- More densely packed gates increases the likelihood of being affected by cosmic rays.
- Refresh times don't seem to have gone anywhere. DDR2 tRFC was 40-60 clocks, DDR3 tRFC was more like 90-130 clocks, and DDR4 tRFC is more like 200-450 clocks. When you divide by the internal memory clock rates to get a wall time for each refresh timing it doesn't really show much of a trend - it's effectively flat but with a higher margin either way as time goes on.
But, as far as I know, we're not seeing bitflips everywhere on non-ECC RAM, at least within the confines of our atmosphere.
So, what's the deal? Why aren't we seeing endless bitflips everywhere, at least 100x if not 10000x more frequently than two decades ago? Is ECC actually important in the context of growing RAM sizes, or do the stats not back it up? Or is there some other technology advance that is mitigating bitflip problems in non-ECC memory? I'm particularly interested in answers with authoritative references rather than speculation about error rates.
memory ram error error-correction ecc
$endgroup$
|
show 14 more comments
$begingroup$
Back in the early 2000s I remember asking about why it was so important that servers use ECC memory. The prevailing wisdom at the time was that systems with lots of RAM would be, statistically, more likely to suffer bitflips. This makes sense - if each cell has a 10-20 probability of suffering a bitflip per second, then 109 cells have a 10-11 probability per second. The more cells you have, the higher the probability of a bitflip in any given time period.
Back then we would be looking at a ballpark of 128MB to 1GB of RAM. These days we regularly put 16GB or more in laptops and desktops, with workstations commonly having 64GB or more. For argument's sake, let's say we've increased total RAM amounts by two orders of magnitude. So we should see a hundred times or so more bitflips, on average, in any given system, assuming that nothing else changed.
The more I thought about it, though, the more I realised that the random bitflip rate should be much higher in newer systems:
- Lower operating voltages means lower distinction between a 0 and a 1.
- Lower gate charge means less overall energy required to flip a bit.
- More densely packed gates increases the likelihood of being affected by cosmic rays.
- Refresh times don't seem to have gone anywhere. DDR2 tRFC was 40-60 clocks, DDR3 tRFC was more like 90-130 clocks, and DDR4 tRFC is more like 200-450 clocks. When you divide by the internal memory clock rates to get a wall time for each refresh timing it doesn't really show much of a trend - it's effectively flat but with a higher margin either way as time goes on.
But, as far as I know, we're not seeing bitflips everywhere on non-ECC RAM, at least within the confines of our atmosphere.
So, what's the deal? Why aren't we seeing endless bitflips everywhere, at least 100x if not 10000x more frequently than two decades ago? Is ECC actually important in the context of growing RAM sizes, or do the stats not back it up? Or is there some other technology advance that is mitigating bitflip problems in non-ECC memory? I'm particularly interested in answers with authoritative references rather than speculation about error rates.
memory ram error error-correction ecc
$endgroup$
2
$begingroup$
@SunnyskyguyEE75 That's irrelevant to the question and makes an assumption that the probabilistic risk of using non-ECC memory is high in the first place, which is the exact assumption that I'm challenging here. I understand probabilistic risk and impact (my career depends upon it); the question is about whether those risks are being overstated in the first place.
$endgroup$
– Polynomial
Apr 17 at 13:15
2
$begingroup$
@SunnyskyguyEE75 I'm not asking how to fix the problem, or how to model probabilistic MTBFs. Just interested in authoritative references as to why a 16GB non-ECC DIMM from 2019 seems to have the same apparent MTBF as a 128MB non-ECC DIMM from 2002, despite the conventional wisdom back then saying "if you have more than 1GB of RAM you should use ECC, because of bit flips" and there being two whole orders of magnitude between the sizes, plus all the other negative factors listed above. If you want to speculate, feel free, but that's not really what I'm looking for in an answer.
$endgroup$
– Polynomial
Apr 17 at 13:29
3
$begingroup$
To a great extent it comes down to the use case. The statistical chance of a bit flip (most commonly due to free neutrons) at sea level is really low, but non-zero, however the other part of that is that you may not notice because the bit that gets flipped may be in an area not currently being used. A study by Boeing on servers in Denver established that bit flips (SEUs in the trade) do indeed occur regularly. In avionics we are required to use ECC as the chance increases drastically with altitude.
$endgroup$
– Peter Smith
Apr 17 at 13:33
2
$begingroup$
@SunnyskyguyEE75 As I've said, I'm not looking to predict it. I work in infosec; I understand the difference between risk/probabilities and predicting actual discrete incidents. It's practically the same as radioactivity - you can model decays/sec, but you can't predict the individual decays with any certainty. I'm asking about differences between rates. "Ask Crucial" isn't a useful answer here.
$endgroup$
– Polynomial
Apr 17 at 13:36
3
$begingroup$
The only company I know of to do a long term study is Xilinx on the Rosetta project. They do publish soft error rates for all their parts. Note that internal geometries and layout (which is used to reduce the neutron cross section) play a large part in effective vulnerability.
$endgroup$
– Peter Smith
Apr 17 at 13:38
|
show 14 more comments
$begingroup$
Back in the early 2000s I remember asking about why it was so important that servers use ECC memory. The prevailing wisdom at the time was that systems with lots of RAM would be, statistically, more likely to suffer bitflips. This makes sense - if each cell has a 10-20 probability of suffering a bitflip per second, then 109 cells have a 10-11 probability per second. The more cells you have, the higher the probability of a bitflip in any given time period.
Back then we would be looking at a ballpark of 128MB to 1GB of RAM. These days we regularly put 16GB or more in laptops and desktops, with workstations commonly having 64GB or more. For argument's sake, let's say we've increased total RAM amounts by two orders of magnitude. So we should see a hundred times or so more bitflips, on average, in any given system, assuming that nothing else changed.
The more I thought about it, though, the more I realised that the random bitflip rate should be much higher in newer systems:
- Lower operating voltages means lower distinction between a 0 and a 1.
- Lower gate charge means less overall energy required to flip a bit.
- More densely packed gates increases the likelihood of being affected by cosmic rays.
- Refresh times don't seem to have gone anywhere. DDR2 tRFC was 40-60 clocks, DDR3 tRFC was more like 90-130 clocks, and DDR4 tRFC is more like 200-450 clocks. When you divide by the internal memory clock rates to get a wall time for each refresh timing it doesn't really show much of a trend - it's effectively flat but with a higher margin either way as time goes on.
But, as far as I know, we're not seeing bitflips everywhere on non-ECC RAM, at least within the confines of our atmosphere.
So, what's the deal? Why aren't we seeing endless bitflips everywhere, at least 100x if not 10000x more frequently than two decades ago? Is ECC actually important in the context of growing RAM sizes, or do the stats not back it up? Or is there some other technology advance that is mitigating bitflip problems in non-ECC memory? I'm particularly interested in answers with authoritative references rather than speculation about error rates.
memory ram error error-correction ecc
$endgroup$
Back in the early 2000s I remember asking about why it was so important that servers use ECC memory. The prevailing wisdom at the time was that systems with lots of RAM would be, statistically, more likely to suffer bitflips. This makes sense - if each cell has a 10-20 probability of suffering a bitflip per second, then 109 cells have a 10-11 probability per second. The more cells you have, the higher the probability of a bitflip in any given time period.
Back then we would be looking at a ballpark of 128MB to 1GB of RAM. These days we regularly put 16GB or more in laptops and desktops, with workstations commonly having 64GB or more. For argument's sake, let's say we've increased total RAM amounts by two orders of magnitude. So we should see a hundred times or so more bitflips, on average, in any given system, assuming that nothing else changed.
The more I thought about it, though, the more I realised that the random bitflip rate should be much higher in newer systems:
- Lower operating voltages means lower distinction between a 0 and a 1.
- Lower gate charge means less overall energy required to flip a bit.
- More densely packed gates increases the likelihood of being affected by cosmic rays.
- Refresh times don't seem to have gone anywhere. DDR2 tRFC was 40-60 clocks, DDR3 tRFC was more like 90-130 clocks, and DDR4 tRFC is more like 200-450 clocks. When you divide by the internal memory clock rates to get a wall time for each refresh timing it doesn't really show much of a trend - it's effectively flat but with a higher margin either way as time goes on.
But, as far as I know, we're not seeing bitflips everywhere on non-ECC RAM, at least within the confines of our atmosphere.
So, what's the deal? Why aren't we seeing endless bitflips everywhere, at least 100x if not 10000x more frequently than two decades ago? Is ECC actually important in the context of growing RAM sizes, or do the stats not back it up? Or is there some other technology advance that is mitigating bitflip problems in non-ECC memory? I'm particularly interested in answers with authoritative references rather than speculation about error rates.
memory ram error error-correction ecc
memory ram error error-correction ecc
edited Apr 17 at 13:46
Polynomial
asked Apr 17 at 13:03
PolynomialPolynomial
3,0404 gold badges31 silver badges59 bronze badges
3,0404 gold badges31 silver badges59 bronze badges
2
$begingroup$
@SunnyskyguyEE75 That's irrelevant to the question and makes an assumption that the probabilistic risk of using non-ECC memory is high in the first place, which is the exact assumption that I'm challenging here. I understand probabilistic risk and impact (my career depends upon it); the question is about whether those risks are being overstated in the first place.
$endgroup$
– Polynomial
Apr 17 at 13:15
2
$begingroup$
@SunnyskyguyEE75 I'm not asking how to fix the problem, or how to model probabilistic MTBFs. Just interested in authoritative references as to why a 16GB non-ECC DIMM from 2019 seems to have the same apparent MTBF as a 128MB non-ECC DIMM from 2002, despite the conventional wisdom back then saying "if you have more than 1GB of RAM you should use ECC, because of bit flips" and there being two whole orders of magnitude between the sizes, plus all the other negative factors listed above. If you want to speculate, feel free, but that's not really what I'm looking for in an answer.
$endgroup$
– Polynomial
Apr 17 at 13:29
3
$begingroup$
To a great extent it comes down to the use case. The statistical chance of a bit flip (most commonly due to free neutrons) at sea level is really low, but non-zero, however the other part of that is that you may not notice because the bit that gets flipped may be in an area not currently being used. A study by Boeing on servers in Denver established that bit flips (SEUs in the trade) do indeed occur regularly. In avionics we are required to use ECC as the chance increases drastically with altitude.
$endgroup$
– Peter Smith
Apr 17 at 13:33
2
$begingroup$
@SunnyskyguyEE75 As I've said, I'm not looking to predict it. I work in infosec; I understand the difference between risk/probabilities and predicting actual discrete incidents. It's practically the same as radioactivity - you can model decays/sec, but you can't predict the individual decays with any certainty. I'm asking about differences between rates. "Ask Crucial" isn't a useful answer here.
$endgroup$
– Polynomial
Apr 17 at 13:36
3
$begingroup$
The only company I know of to do a long term study is Xilinx on the Rosetta project. They do publish soft error rates for all their parts. Note that internal geometries and layout (which is used to reduce the neutron cross section) play a large part in effective vulnerability.
$endgroup$
– Peter Smith
Apr 17 at 13:38
|
show 14 more comments
2
$begingroup$
@SunnyskyguyEE75 That's irrelevant to the question and makes an assumption that the probabilistic risk of using non-ECC memory is high in the first place, which is the exact assumption that I'm challenging here. I understand probabilistic risk and impact (my career depends upon it); the question is about whether those risks are being overstated in the first place.
$endgroup$
– Polynomial
Apr 17 at 13:15
2
$begingroup$
@SunnyskyguyEE75 I'm not asking how to fix the problem, or how to model probabilistic MTBFs. Just interested in authoritative references as to why a 16GB non-ECC DIMM from 2019 seems to have the same apparent MTBF as a 128MB non-ECC DIMM from 2002, despite the conventional wisdom back then saying "if you have more than 1GB of RAM you should use ECC, because of bit flips" and there being two whole orders of magnitude between the sizes, plus all the other negative factors listed above. If you want to speculate, feel free, but that's not really what I'm looking for in an answer.
$endgroup$
– Polynomial
Apr 17 at 13:29
3
$begingroup$
To a great extent it comes down to the use case. The statistical chance of a bit flip (most commonly due to free neutrons) at sea level is really low, but non-zero, however the other part of that is that you may not notice because the bit that gets flipped may be in an area not currently being used. A study by Boeing on servers in Denver established that bit flips (SEUs in the trade) do indeed occur regularly. In avionics we are required to use ECC as the chance increases drastically with altitude.
$endgroup$
– Peter Smith
Apr 17 at 13:33
2
$begingroup$
@SunnyskyguyEE75 As I've said, I'm not looking to predict it. I work in infosec; I understand the difference between risk/probabilities and predicting actual discrete incidents. It's practically the same as radioactivity - you can model decays/sec, but you can't predict the individual decays with any certainty. I'm asking about differences between rates. "Ask Crucial" isn't a useful answer here.
$endgroup$
– Polynomial
Apr 17 at 13:36
3
$begingroup$
The only company I know of to do a long term study is Xilinx on the Rosetta project. They do publish soft error rates for all their parts. Note that internal geometries and layout (which is used to reduce the neutron cross section) play a large part in effective vulnerability.
$endgroup$
– Peter Smith
Apr 17 at 13:38
2
2
$begingroup$
@SunnyskyguyEE75 That's irrelevant to the question and makes an assumption that the probabilistic risk of using non-ECC memory is high in the first place, which is the exact assumption that I'm challenging here. I understand probabilistic risk and impact (my career depends upon it); the question is about whether those risks are being overstated in the first place.
$endgroup$
– Polynomial
Apr 17 at 13:15
$begingroup$
@SunnyskyguyEE75 That's irrelevant to the question and makes an assumption that the probabilistic risk of using non-ECC memory is high in the first place, which is the exact assumption that I'm challenging here. I understand probabilistic risk and impact (my career depends upon it); the question is about whether those risks are being overstated in the first place.
$endgroup$
– Polynomial
Apr 17 at 13:15
2
2
$begingroup$
@SunnyskyguyEE75 I'm not asking how to fix the problem, or how to model probabilistic MTBFs. Just interested in authoritative references as to why a 16GB non-ECC DIMM from 2019 seems to have the same apparent MTBF as a 128MB non-ECC DIMM from 2002, despite the conventional wisdom back then saying "if you have more than 1GB of RAM you should use ECC, because of bit flips" and there being two whole orders of magnitude between the sizes, plus all the other negative factors listed above. If you want to speculate, feel free, but that's not really what I'm looking for in an answer.
$endgroup$
– Polynomial
Apr 17 at 13:29
$begingroup$
@SunnyskyguyEE75 I'm not asking how to fix the problem, or how to model probabilistic MTBFs. Just interested in authoritative references as to why a 16GB non-ECC DIMM from 2019 seems to have the same apparent MTBF as a 128MB non-ECC DIMM from 2002, despite the conventional wisdom back then saying "if you have more than 1GB of RAM you should use ECC, because of bit flips" and there being two whole orders of magnitude between the sizes, plus all the other negative factors listed above. If you want to speculate, feel free, but that's not really what I'm looking for in an answer.
$endgroup$
– Polynomial
Apr 17 at 13:29
3
3
$begingroup$
To a great extent it comes down to the use case. The statistical chance of a bit flip (most commonly due to free neutrons) at sea level is really low, but non-zero, however the other part of that is that you may not notice because the bit that gets flipped may be in an area not currently being used. A study by Boeing on servers in Denver established that bit flips (SEUs in the trade) do indeed occur regularly. In avionics we are required to use ECC as the chance increases drastically with altitude.
$endgroup$
– Peter Smith
Apr 17 at 13:33
$begingroup$
To a great extent it comes down to the use case. The statistical chance of a bit flip (most commonly due to free neutrons) at sea level is really low, but non-zero, however the other part of that is that you may not notice because the bit that gets flipped may be in an area not currently being used. A study by Boeing on servers in Denver established that bit flips (SEUs in the trade) do indeed occur regularly. In avionics we are required to use ECC as the chance increases drastically with altitude.
$endgroup$
– Peter Smith
Apr 17 at 13:33
2
2
$begingroup$
@SunnyskyguyEE75 As I've said, I'm not looking to predict it. I work in infosec; I understand the difference between risk/probabilities and predicting actual discrete incidents. It's practically the same as radioactivity - you can model decays/sec, but you can't predict the individual decays with any certainty. I'm asking about differences between rates. "Ask Crucial" isn't a useful answer here.
$endgroup$
– Polynomial
Apr 17 at 13:36
$begingroup$
@SunnyskyguyEE75 As I've said, I'm not looking to predict it. I work in infosec; I understand the difference between risk/probabilities and predicting actual discrete incidents. It's practically the same as radioactivity - you can model decays/sec, but you can't predict the individual decays with any certainty. I'm asking about differences between rates. "Ask Crucial" isn't a useful answer here.
$endgroup$
– Polynomial
Apr 17 at 13:36
3
3
$begingroup$
The only company I know of to do a long term study is Xilinx on the Rosetta project. They do publish soft error rates for all their parts. Note that internal geometries and layout (which is used to reduce the neutron cross section) play a large part in effective vulnerability.
$endgroup$
– Peter Smith
Apr 17 at 13:38
$begingroup$
The only company I know of to do a long term study is Xilinx on the Rosetta project. They do publish soft error rates for all their parts. Note that internal geometries and layout (which is used to reduce the neutron cross section) play a large part in effective vulnerability.
$endgroup$
– Peter Smith
Apr 17 at 13:38
|
show 14 more comments
2 Answers
2
active
oldest
votes
$begingroup$
Single-event upsets (SEU) at sea level tend to be caused either by radioactive contaminants in the IC manufacturing materials (particularly the metals) generating alpha particles or by high-energy neutrons (caused by cosmic rays in the atmosphere) ionizing atoms in the silicon itself.
Over the years, manufacturers have greatly reduced the threat caused by radioactive contaminants. There are also proprietary approaches to cell layout that can help to mitigate the risk of SEU. All of this is probably going to be trade secrets and not public information.
And, no, I'm not going to do your literature search for you. However, I recommend that you go through the IEEE Transactions on Nuclear Science if you want references.
$endgroup$
$begingroup$
Are saying these are the only significant causes of correctable errors ? Or uncorrectable errors or both or just limited to one cause of the SEU
$endgroup$
– Sunnyskyguy EE75
Apr 17 at 15:04
$begingroup$
Building on this, for the cosmic ray induced SEU case, the probability of getting hit with a cosmic ray correlates to die area, not number of bits within that die, and in general, memory die area has actually been going down over the years.
$endgroup$
– Nate Strickland
Apr 17 at 16:56
add a comment
|
$begingroup$
The answer is, more work needs to be done, and they aren't sure:
The results show that the radiation susceptibility has actually
improved somewhat for devices that have advanced to the 0.13μm level,
which contradicts earlier predictions. This trend is encouraging, but
it may not necessarily continue for devices that are scaled below 0.1
μm. It is important to note that the recent computer modeling
calculations for SEE susceptibility of scaled devices predicts a large
increase in collected charge that is directly in conflict with test
results for heavy ions and neutron soft errors.
Source: The Effect of Device Scaling on
Single-Event Effects in Advance
CMOS Devices

Source: The Effect of Device Scaling on
Single-Event Effects in Advance
CMOS Devices
The actual mechanism for SEE's/SEU's has more to do with voltage and geometry than size. The one would think that smaller size and smaller charge per memory element would make memory elements easier to flip and cause errors, but the effect is small and is more related to voltage and geometry. Which is good for space applications that rely increasingly on commercial technology (like cubesats).
$endgroup$
add a comment
|
Your Answer
StackExchange.ifUsing("editor", function ()
return StackExchange.using("schematics", function ()
StackExchange.schematics.init();
);
, "cicuitlab");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "135"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/4.0/"u003ecc by-sa 4.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2felectronics.stackexchange.com%2fquestions%2f433058%2fwhy-have-apparent-memory-bitflips-in-non-ecc-memory-not-increased%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
Single-event upsets (SEU) at sea level tend to be caused either by radioactive contaminants in the IC manufacturing materials (particularly the metals) generating alpha particles or by high-energy neutrons (caused by cosmic rays in the atmosphere) ionizing atoms in the silicon itself.
Over the years, manufacturers have greatly reduced the threat caused by radioactive contaminants. There are also proprietary approaches to cell layout that can help to mitigate the risk of SEU. All of this is probably going to be trade secrets and not public information.
And, no, I'm not going to do your literature search for you. However, I recommend that you go through the IEEE Transactions on Nuclear Science if you want references.
$endgroup$
$begingroup$
Are saying these are the only significant causes of correctable errors ? Or uncorrectable errors or both or just limited to one cause of the SEU
$endgroup$
– Sunnyskyguy EE75
Apr 17 at 15:04
$begingroup$
Building on this, for the cosmic ray induced SEU case, the probability of getting hit with a cosmic ray correlates to die area, not number of bits within that die, and in general, memory die area has actually been going down over the years.
$endgroup$
– Nate Strickland
Apr 17 at 16:56
add a comment
|
$begingroup$
Single-event upsets (SEU) at sea level tend to be caused either by radioactive contaminants in the IC manufacturing materials (particularly the metals) generating alpha particles or by high-energy neutrons (caused by cosmic rays in the atmosphere) ionizing atoms in the silicon itself.
Over the years, manufacturers have greatly reduced the threat caused by radioactive contaminants. There are also proprietary approaches to cell layout that can help to mitigate the risk of SEU. All of this is probably going to be trade secrets and not public information.
And, no, I'm not going to do your literature search for you. However, I recommend that you go through the IEEE Transactions on Nuclear Science if you want references.
$endgroup$
$begingroup$
Are saying these are the only significant causes of correctable errors ? Or uncorrectable errors or both or just limited to one cause of the SEU
$endgroup$
– Sunnyskyguy EE75
Apr 17 at 15:04
$begingroup$
Building on this, for the cosmic ray induced SEU case, the probability of getting hit with a cosmic ray correlates to die area, not number of bits within that die, and in general, memory die area has actually been going down over the years.
$endgroup$
– Nate Strickland
Apr 17 at 16:56
add a comment
|
$begingroup$
Single-event upsets (SEU) at sea level tend to be caused either by radioactive contaminants in the IC manufacturing materials (particularly the metals) generating alpha particles or by high-energy neutrons (caused by cosmic rays in the atmosphere) ionizing atoms in the silicon itself.
Over the years, manufacturers have greatly reduced the threat caused by radioactive contaminants. There are also proprietary approaches to cell layout that can help to mitigate the risk of SEU. All of this is probably going to be trade secrets and not public information.
And, no, I'm not going to do your literature search for you. However, I recommend that you go through the IEEE Transactions on Nuclear Science if you want references.
$endgroup$
Single-event upsets (SEU) at sea level tend to be caused either by radioactive contaminants in the IC manufacturing materials (particularly the metals) generating alpha particles or by high-energy neutrons (caused by cosmic rays in the atmosphere) ionizing atoms in the silicon itself.
Over the years, manufacturers have greatly reduced the threat caused by radioactive contaminants. There are also proprietary approaches to cell layout that can help to mitigate the risk of SEU. All of this is probably going to be trade secrets and not public information.
And, no, I'm not going to do your literature search for you. However, I recommend that you go through the IEEE Transactions on Nuclear Science if you want references.
answered Apr 17 at 14:11
Elliot AldersonElliot Alderson
12.4k2 gold badges12 silver badges25 bronze badges
12.4k2 gold badges12 silver badges25 bronze badges
$begingroup$
Are saying these are the only significant causes of correctable errors ? Or uncorrectable errors or both or just limited to one cause of the SEU
$endgroup$
– Sunnyskyguy EE75
Apr 17 at 15:04
$begingroup$
Building on this, for the cosmic ray induced SEU case, the probability of getting hit with a cosmic ray correlates to die area, not number of bits within that die, and in general, memory die area has actually been going down over the years.
$endgroup$
– Nate Strickland
Apr 17 at 16:56
add a comment
|
$begingroup$
Are saying these are the only significant causes of correctable errors ? Or uncorrectable errors or both or just limited to one cause of the SEU
$endgroup$
– Sunnyskyguy EE75
Apr 17 at 15:04
$begingroup$
Building on this, for the cosmic ray induced SEU case, the probability of getting hit with a cosmic ray correlates to die area, not number of bits within that die, and in general, memory die area has actually been going down over the years.
$endgroup$
– Nate Strickland
Apr 17 at 16:56
$begingroup$
Are saying these are the only significant causes of correctable errors ? Or uncorrectable errors or both or just limited to one cause of the SEU
$endgroup$
– Sunnyskyguy EE75
Apr 17 at 15:04
$begingroup$
Are saying these are the only significant causes of correctable errors ? Or uncorrectable errors or both or just limited to one cause of the SEU
$endgroup$
– Sunnyskyguy EE75
Apr 17 at 15:04
$begingroup$
Building on this, for the cosmic ray induced SEU case, the probability of getting hit with a cosmic ray correlates to die area, not number of bits within that die, and in general, memory die area has actually been going down over the years.
$endgroup$
– Nate Strickland
Apr 17 at 16:56
$begingroup$
Building on this, for the cosmic ray induced SEU case, the probability of getting hit with a cosmic ray correlates to die area, not number of bits within that die, and in general, memory die area has actually been going down over the years.
$endgroup$
– Nate Strickland
Apr 17 at 16:56
add a comment
|
$begingroup$
The answer is, more work needs to be done, and they aren't sure:
The results show that the radiation susceptibility has actually
improved somewhat for devices that have advanced to the 0.13μm level,
which contradicts earlier predictions. This trend is encouraging, but
it may not necessarily continue for devices that are scaled below 0.1
μm. It is important to note that the recent computer modeling
calculations for SEE susceptibility of scaled devices predicts a large
increase in collected charge that is directly in conflict with test
results for heavy ions and neutron soft errors.
Source: The Effect of Device Scaling on
Single-Event Effects in Advance
CMOS Devices

Source: The Effect of Device Scaling on
Single-Event Effects in Advance
CMOS Devices
The actual mechanism for SEE's/SEU's has more to do with voltage and geometry than size. The one would think that smaller size and smaller charge per memory element would make memory elements easier to flip and cause errors, but the effect is small and is more related to voltage and geometry. Which is good for space applications that rely increasingly on commercial technology (like cubesats).
$endgroup$
add a comment
|
$begingroup$
The answer is, more work needs to be done, and they aren't sure:
The results show that the radiation susceptibility has actually
improved somewhat for devices that have advanced to the 0.13μm level,
which contradicts earlier predictions. This trend is encouraging, but
it may not necessarily continue for devices that are scaled below 0.1
μm. It is important to note that the recent computer modeling
calculations for SEE susceptibility of scaled devices predicts a large
increase in collected charge that is directly in conflict with test
results for heavy ions and neutron soft errors.
Source: The Effect of Device Scaling on
Single-Event Effects in Advance
CMOS Devices

Source: The Effect of Device Scaling on
Single-Event Effects in Advance
CMOS Devices
The actual mechanism for SEE's/SEU's has more to do with voltage and geometry than size. The one would think that smaller size and smaller charge per memory element would make memory elements easier to flip and cause errors, but the effect is small and is more related to voltage and geometry. Which is good for space applications that rely increasingly on commercial technology (like cubesats).
$endgroup$
add a comment
|
$begingroup$
The answer is, more work needs to be done, and they aren't sure:
The results show that the radiation susceptibility has actually
improved somewhat for devices that have advanced to the 0.13μm level,
which contradicts earlier predictions. This trend is encouraging, but
it may not necessarily continue for devices that are scaled below 0.1
μm. It is important to note that the recent computer modeling
calculations for SEE susceptibility of scaled devices predicts a large
increase in collected charge that is directly in conflict with test
results for heavy ions and neutron soft errors.
Source: The Effect of Device Scaling on
Single-Event Effects in Advance
CMOS Devices

Source: The Effect of Device Scaling on
Single-Event Effects in Advance
CMOS Devices
The actual mechanism for SEE's/SEU's has more to do with voltage and geometry than size. The one would think that smaller size and smaller charge per memory element would make memory elements easier to flip and cause errors, but the effect is small and is more related to voltage and geometry. Which is good for space applications that rely increasingly on commercial technology (like cubesats).
$endgroup$
The answer is, more work needs to be done, and they aren't sure:
The results show that the radiation susceptibility has actually
improved somewhat for devices that have advanced to the 0.13μm level,
which contradicts earlier predictions. This trend is encouraging, but
it may not necessarily continue for devices that are scaled below 0.1
μm. It is important to note that the recent computer modeling
calculations for SEE susceptibility of scaled devices predicts a large
increase in collected charge that is directly in conflict with test
results for heavy ions and neutron soft errors.
Source: The Effect of Device Scaling on
Single-Event Effects in Advance
CMOS Devices

Source: The Effect of Device Scaling on
Single-Event Effects in Advance
CMOS Devices
The actual mechanism for SEE's/SEU's has more to do with voltage and geometry than size. The one would think that smaller size and smaller charge per memory element would make memory elements easier to flip and cause errors, but the effect is small and is more related to voltage and geometry. Which is good for space applications that rely increasingly on commercial technology (like cubesats).
answered Apr 17 at 17:23
Voltage SpikeVoltage Spike
39.7k12 gold badges44 silver badges116 bronze badges
39.7k12 gold badges44 silver badges116 bronze badges
add a comment
|
add a comment
|
Thanks for contributing an answer to Electrical Engineering Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2felectronics.stackexchange.com%2fquestions%2f433058%2fwhy-have-apparent-memory-bitflips-in-non-ecc-memory-not-increased%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
2
$begingroup$
@SunnyskyguyEE75 That's irrelevant to the question and makes an assumption that the probabilistic risk of using non-ECC memory is high in the first place, which is the exact assumption that I'm challenging here. I understand probabilistic risk and impact (my career depends upon it); the question is about whether those risks are being overstated in the first place.
$endgroup$
– Polynomial
Apr 17 at 13:15
2
$begingroup$
@SunnyskyguyEE75 I'm not asking how to fix the problem, or how to model probabilistic MTBFs. Just interested in authoritative references as to why a 16GB non-ECC DIMM from 2019 seems to have the same apparent MTBF as a 128MB non-ECC DIMM from 2002, despite the conventional wisdom back then saying "if you have more than 1GB of RAM you should use ECC, because of bit flips" and there being two whole orders of magnitude between the sizes, plus all the other negative factors listed above. If you want to speculate, feel free, but that's not really what I'm looking for in an answer.
$endgroup$
– Polynomial
Apr 17 at 13:29
3
$begingroup$
To a great extent it comes down to the use case. The statistical chance of a bit flip (most commonly due to free neutrons) at sea level is really low, but non-zero, however the other part of that is that you may not notice because the bit that gets flipped may be in an area not currently being used. A study by Boeing on servers in Denver established that bit flips (SEUs in the trade) do indeed occur regularly. In avionics we are required to use ECC as the chance increases drastically with altitude.
$endgroup$
– Peter Smith
Apr 17 at 13:33
2
$begingroup$
@SunnyskyguyEE75 As I've said, I'm not looking to predict it. I work in infosec; I understand the difference between risk/probabilities and predicting actual discrete incidents. It's practically the same as radioactivity - you can model decays/sec, but you can't predict the individual decays with any certainty. I'm asking about differences between rates. "Ask Crucial" isn't a useful answer here.
$endgroup$
– Polynomial
Apr 17 at 13:36
3
$begingroup$
The only company I know of to do a long term study is Xilinx on the Rosetta project. They do publish soft error rates for all their parts. Note that internal geometries and layout (which is used to reduce the neutron cross section) play a large part in effective vulnerability.
$endgroup$
– Peter Smith
Apr 17 at 13:38