Signal 11 while compiling the kernel
This FAQ describes what the possible causes are for an effect that bothers lots of people lately. Namely that a linux(*)-kernel (or any other large package for that matter) compile crashes with a "signal 11". The cause can be software or (most likely) hardware. Read on to find out more.
(*) Of course nothing is Linux specific. If your hardware is flaky, Linux, Windows 3.1, FreeBSD, Windows NT and NextStep will all crash.
If you are not reading this at
http://www.BitWizard.nl/sig11/, that's where you can find the most recent version.
For those of you who prefer reading this in French, the French translation can be found at
http://www.linux-france.org/article/sig11-fr/.
For those of you who prefer reading japanese, the Japanese translation can be found at
http://www.linux.or.jp/JF/JFdocs/GCC-SIG11-FAQ/.
Email me at
R.E.Wolff@BitWizard.nl if you find any spelling errors, worthwhile additions or with an "it also happened to me" story. (Note that I reject some suggested additions on my belief that it is technical nonsense). I would appreciate it if you put "sig11" or something like that in the subject. You can also Email me about other subjects.
--------------------------------------------------------------------------------
The Sig11 FAQ
QUESTION
Signal 11, what does that mean?
ANSWER
Signal 11, or officially know as "segmentation fault", means that the program accessed a memory location that was not assigned. That's usually a bug in the program. So if you're writing your own program, that's the most likely cause. However, this FAQ will concentrate on the possibilities besides that.
QUESTION
My (kernel) compile crashes with
gcc: Internal compiler error: program cc1 got fatal signal 11
What is wrong with the compiler? Which version of the compiler do I need? Is there something wrong with the kernel?
ANSWER
Most likely there is nothing wrong with your installation, your compiler or kernel. It very likely has something to do with your hardware. There are a variety of subsystems that can be wrong, and there is a variety of ways to fix it. Read on, and you'll find out more. There are two exceptions to this "rule". You could be running low on virtual memory, or you could be installing Red Hat 5.x, 6.x or 7.x. There is more about this near the end.
--------------------------------------------------------------------------------
QUESTION
Ok it may not be the software, How do I know for sure?
ANSWER
First lets make sure it is the hardware that is causing your trouble. When the "make" stops, simply type "make" again. If it compiles a few more files before stopping, it must be hardware that is causing you troubles. If it immediately stops again (i.e. scans a few directories with "nothing to be done for xxxx" before bombing at exactly the same place), try
dd if=/dev/HARD_DISK of=/dev/null bs=1024k count=MEGS
Change HARD_DISK to "hda" to the name of your harddisk (e.g. hda or sda. Or use "df ."). Change the MEGS to the number of megabytes of main memory that you have. This will cause the first several megabytes of your harddisk to be read from disk, forcing the C source files and the gcc binary to be reread from disk the next time you run it. Now type make again. If it still stops in the same place I'm starting to wonder if you're reading the right FAQ, as it is starting to look like a software problem after all.... Take a peek at the "what are the other possibilities" question..... If without this "dd" command the compiler keeps on stopping at the same place, but moves to another place after you use the "dd" you definitely have a disk->ram transfer problem.
QUESTION
What does it really mean? Are you sure it's a hardware problem?
ANSWER
Well, the compiler accessed memory outside its memory range. If this happens on working hardware it's a programming error inside the compiler. That's why it says "internal compiler error". However when the hardware occasionally flips a bit, gcc uses so many pointers, that it is likely to end up accessing something outside of its addressing range. (random addresses are mostly outside your addressing range, as not very many people have a significant part of 4G as main memory... :-) It seems that nowadays, everybody with "signal 11" problems gets directed to this page. If you're developing your own software or have software that hasn't been debugged quite enough, "signal 11" (or segmentation fault) is still a very strong hint that there is something wrong with the program. Only when a program like "gcc" that works for almost everybody else to crash on a dataset (e.g. the Linux-kernel) that has also been well-tested, then it becomes a hint that there is something wrong with your hardware. If some software component like a hardware driver in your system is broken, it could cause symptoms that are VERY close to those of a hardware failure. However, when a driver is faulty it is more likely to cause serious trouble inside the kernel, than just causing the compiler to crash.
--------------------------------------------------------------------------------
QUESTION
Ok. I may have a hardware problem what is it?
ANSWER
If it happens to be the hardware it can be:
Main memory. Your main memory might be getting an occasional bit wrong. If this happens on the "writes", you won't see any parity errors. There are several ways to fix it:
The memory speed might be too slow. Increase the number of wait states in the BIOS.
This could be caused by the AMIBIOSs autoconfig option: it may only know about 486s running upto 80 MHz, whereas you currently buy 100 MHz versions. -- Pat V.
The memory speed might be too slow. Get faster DRAM SIMMs. For example current ASUS motherboards require 60 ns DRAM if you have a 100, or 133 MHz processor (Take a look in your motherboard's manual). I've heard reports that 70 ns also works, reliability problems like random sig11's belong to the possibilities.... (I wouldn't take the risk) -- Andrew Eskilsson (mpt95aes@pt.hk-r.se)
You might think that you can run your 100MHz SDRAMs at 100MHz. Wrong! read
http://www.bitwizard.nl/sig11/sdram.html why I think this is the case. You need at least one speed grade faster than the speed they are rated for.
There is a bad chip on one of the SIMMs. If you own more than 1 bank of memory you might be able to pull SIMMs and see if the problem goes away. Be careful for STATIC!!!
We handled a hard one here the last week. It turned out that ALL 4 16Mb SIMMs were broken in that they dropped a bit around once per hour. This was sufficient to crash the machine in about a day, or crash a kernel compile in about an hour. A new set of SIMMs works perfectly. It took a long while to diagnose this one, because all 4 of the SIMMs were affected equally, so leaving half of the memory out didn't change things.
Mark Kettner (kettner@cat.et.tudelft.nl) reports that his system was capable of running my memory test for 2300 times faultlessly, but then detected around 10 errors. It then continued detecting no faults for a few hundred runs again..... In his case running kernel compiles was a much more efficient way of detecting the health of the system (in the most stable configuration the system could compile around 14 kernels before going bzurk). His solution was to "trade in" the old memory for a so called "memory upgrade". The shopkeeper then "tests" in their memory tester, which OKs the memory. He then got a good discount on the new memory :-).
It seems that some 30-72 pin converters can cause memory errors. (See how old this entry is? Who remembers 30pin SIMMs? However all these things hold perfectly for SIMM <-> DIMM converters, or socket370 <-> slot 1 converters) (It hasn't been proven whether the 4 SIMMS in the converter had gone bad, or if the SIMM converter was at fault. The SIMMS had been functioning perfectly for years before they were moved into the converter....) -- Naresh Sharma (n.sharma@is.twi.tudelft.nl). Paul Gortmaker (paul.gortmaker@anu.edu.au) adds that the SIMM converters should have at least 4 bypass capacitors to keep the power supply of the SIMMs clean.
If the refresh of the DRAM isn't functioning properly, the DRAMs will slowly lose their information. Some (486) motherboards stop refreshing correctly when you turn on "hidden refresh". There seems to be a program called "dram" around that can also mess up your refresh to cause sig11 problems. -- Hank Barta (hank@pswin.chi.il.us), Ron Tapia (tapia@nmia.com)
The number of wait states could be too low. Increase the number of waist states in the BIOS for a fix. The Intel Endeavour board doesn't allow you to increase the memory wait states. This can supposedly be fixed by flashing a MR BIOS into the motherboard. -- David Halls (david.halls@cl.cam.ac.uk)
Cache memory. Your cache memory might be getting an occasional bit wrong. Caches are usually not equipped with parity. You can diagnose that this is the case by turning off the cache in the BIOS. If the problem goes away it is probably the cache. There are several ways to fix it:
The cache memory speed might be too slow. Increase the number of wait states in the BIOS.
The cache memory speed might be too slow. Get faster SRAM chips.
There is a bad chip in your cache. It is unlikely that you can swap chips as easily as with SIMMs. Be careful for STATIC!!! -- Joseph Barone (barone@mntr02.psf.ge.com)
The cache might be set to "write back" while there is a bug in the write back implementation of your chipset. The motherboard where this happened was a "MV020 486VL3H" (with 20M RAM) -- Scott Brumbaugh (scottb@borris.beachnet.com) (Mail address doesn't work. Scott: Get back at me with a valid return address)
The motherboard may require a jumper to switch between Cache On A Stick and the old-fashioned dip chip cache. (JP16 on Rev 2.4 ASUS P/I-P55TP4XE motherboards)
Disk transfers. A block coming from disk might incur an occasional bit error.
If you have this problem, you are most likely to have to do the "dd" command to "move" the problem from one place to the next....
Some IDE harddisks cannot handle the "irq_unmasking" option. This may only show under load. And it could show as a sig11.
Do you have a kalok 31xx? Throw it in the garbage. (or sell it to a DOS user. Update: Haven't heard about kalok for years. They're probably bust. The drives also don't work with W95 by the way.)
SCSI? Termination? A short bus might still work (unreliably that is) with bad termination. A long bus might get errors anyway. Can you turn on parity on the host and the DISK?
The CPU itself. Some batches of processors have a much higher percentage of them that happen to be "bad". Some years ago: original Intel-Pentium-120's. A few years ago AMD K6/2-300's (1998, produced in weeks 34 through 39!). And recently AMD K6/2-450's. Some people may decide that say 400MHz is acceptable to them, however if this turns out to be the problem, you're entitled to a new processor. Go and exchange it where you bought it. (Forget about those P120's, it's not worth the trouble... ;-) -- Guillaume Cottenceau (gcottenc@ens.insa-rennes.fr).
The CPU itself. Some batches of K6 processors simply have a design bug. Read
http://www.multimania.com/poulot/k6bug.html and then make sure you get your K6 exchanged. -- Rongen (rongen@istar.ca).
Overclocking. Cyrix P-166 processors run at 133MHz, not at 166. This must be logical to the guys at Cyrix, but nobody else. You're overclocking them if you run them at 166Mhz.....
Overclocking. Some vendors (or private people) think it is possible to overclock some CPUs. Some of them may work others don't. You might want to try turning off turbo (note that most pentium motherboards no longer support a non-turbo mode) and see if the problem goes away. Check the speed of your CPU compared (printed on it, carefully remove the fan if necessary) with what the motherboard jumpers or BIOS settings say.... It seems that even Intel may make mistakes in this area. I now have several reliable reports that official pentium would sig11 at their rated speed, but not at a lower speed. As for some speeds the motherboard is only stressed HARDER for a slower processor speed, (120 MHz-> motherboard runs at 60MHz, 100MHz-> motherboard runs at 66MHz), I think it is unlikely that this has anything to do with the motherboard. Moreover a new 120MHz processor is now functioning correctly. -- Samuel Ramac (sramac@vnet.ibm.com). This is not unique to Intel or any of its competitors.
CPU temperature. A high speed processor might overheat without the correct heat sink. This can also be caused by a failing fan. (My personal '486 has a fan that takes a few minutes to get up to speed. It probably will never really FAIL because it's now decommissioned :-). The CPU can become erratic if "pushed" by compiling a kernel. This problem becomes worse if you disable "HALT" on the LILO command line. Linux tries to power-down the CPU by executing the "halt" instruction when the system is idle. This preserves power, and therefore the CPU temperature drops when the system is idle. You therefore might not notice this problem when simply editing, and it might only surface after hours of CPU intensive jobs when the ambient temp is high. If you have a Pentium with Fdiv bug, it is advisable to trade it in at Intel. They will send you a new one that pre-configured with an official Intel-approved FAN. Also note that most normal glues are very bad thermal conductors. There is special thermal glue available that should be used when a fan needs to be glued to a CPU. -- Arno Griffioen (arno@ixe.net), -- W. Paul Mills (wpmills@midusa.net) -- Alan Wind (wind@imada.ou.dk)
Intel says that the allowable temperature ranges for the outside of your CPU is:
0 to +85 C: Intel486 SX, Intel486 DX, IntelDX2, IntelDX4 processor
0 to +95 C: IntelDX2, IntelDX4 OverDrive?processors
0 to +80 C: 60 MHz Pentium?processor
0 to +70 C: 66 to 166 MHz Pentium processor
For information on how to measure this and some confirmation of what I say here, see:
http://pentium.intel.com/procs/support/faqs/iarcfaq.htm (Especially questions Q5, Q6 and Q12. The document is getting slightly outdated, but it is still very accurate. It seems the questions move around a bit every now and then as well.)
CPU voltage. Some motherboards allow you to select the CPU voltage. Some motherboards badly document the jumper settings that manage this. It seems that a 5V processor might still work most of the time at 3.3 volts..... -- Karl Heyes (krheyes@comp.brad.ac.uk)
RAM voltage. It seems that vendors are preparing for 3.3V RAM now. Most memory is now 3.3V. (but be careful if you have a board capable of setting the RAM voltage: 3.3v RAM will break at 5V.....) (Having heard little about this, I think the switch must be automatic.)
Local bus overloading. At 25 MHz you're allowed to have 3 VesaLocalBus (VLB) cards, At 33MHz only two, at 40MHz only one and guess what at 50MHz NONE! (i.e. you are allowed to run your system with a 50MHz local bus, but then you're not allowed to use any VLB cards). Some systems start acting flaky when you overload the VLB. Even when your VLB isn't overloaded (over the limits stated above), the system may lose a few nanoseconds of margin by adding an extra VLB card, so you might need to add a cache wait state or something after you've added a new VLB card.... -- Richard Postgate (postgate@cafe.net)
Power management. Some laptops (and nowadays also "green" pc's) have power management features. These might interfere with Linux. One feature might save a memory image to HD and restore the RAM when you press a key. This sounds like fun, but Linux device drivers don't expect that the hardware has been turned off between two accesses. Some may recover, but others not. Try turning it off, or enabling "APM support" in your kernel. -- Elizabeth Ayer (eca23@cam.ac.uk)
Dust buildup. Some dust might conduct a bit and create a weak short. It might increase capacitances somewhere, and degrade timing characteristics. It might impede thermal flow, and lead to overheating components. It might even short a jumper connection! I recommend that every year or so, it is a good idea to open up your computer, and vacuum the inside. Tip: Those cotton-on-a-stick thingies help prodding the dust out of inaccessible spots... -- Craig Graham (c_graham@hinge.mistral.co.uk)
The CPU itself. Several people are reporting that they have found nothing to blame except the CPU. This could also have been an incompatibility between the CPU and the motherboard. A wave of reports concerning Intel CPUs has passed (Feb '97). A new wave of reports is coming in that are blaming Cyrix/IBM 6x86 CPUs. Although it could indeed be the CPU, it could also be that your motherboard is incompatible with your CPU. At least I've seen a motherboard manual mention that it isn't compatible with older 6x86's. My own experience is that these devices aren't bad at all, and on a kernel compile I benchmarked a P166+ to be equivalent with a P155 (1.3 times faster than a P120).
The Memory hole. Many modern motherboards allow you to use old ISA video cards with one or two megabytes of linear frame buffer. To achieve this, they have to map out the memory just below 16Mb. Nobody actually ever used this feature, but if you turn the memory hole (or LFB support in some BIOSes) on, your machine will certainly be flaky..... -- Paul Connolly (pconnolly@macdux.com.au)
The Microcode. Especially on SMP systems, the CPUS may need an upgrade. Since the Pentium division disaster, Intel have their CPUs field upgradable! The CPU can be bumped a few versions by a special instruction from the BIOS. These upgrades usually come with your BIOS, so make sure you're running the latest BIOS, especially if you have an SMP system. -- Jeffrey Friedl (Email withheld).
--------------------------------------------------------------------------------
QUESTION
RAM timing problems? I fiddled with the bios settings more than a month ago. I've compiled numerous kernels in the mean time and nothing went wrong. It can't be the RAM timing. Right?
ANSWER
Wrong. Do you think that the RAM manufacturers have a machine that makes 60ns RAMs and another one that makes 70ns RAMs? Of course not! They make a bunch, and then test them. Some meet the specs for 60 ns, others don't. Those might be 61 ns if the manufacturer would have to put a number to it. In that case it is quite likely that it works in your computer when for example the temperature is below 40 degrees centigrade (chips become slower when the temp rises. That's why some supercomputers need so much cooling).
However "the coming of summer" or a long compile job may push the temperature inside your computer over the "limit". -- Philippe Troin (ptroin@compass-da.com)
--------------------------------------------------------------------------------