136 lines
		
	
	
	
		
			6.3 KiB
		
	
	
	
		
			Text
		
	
	
	
	
	
			
		
		
	
	
			136 lines
		
	
	
	
		
			6.3 KiB
		
	
	
	
		
			Text
		
	
	
	
	
	
| nedalloc v1.05 15th June 2008:
 | |
| -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
 | |
| 
 | |
| by Niall Douglas (http://www.nedprod.com/programs/portable/nedmalloc/)
 | |
| 
 | |
| Enclosed is nedalloc, an alternative malloc implementation for multiple
 | |
| threads without lock contention based on dlmalloc v2.8.4. It is more
 | |
| or less a newer implementation of ptmalloc2, the standard allocator in
 | |
| Linux (which is based on dlmalloc v2.7.0) but also contains a per-thread
 | |
| cache for maximum CPU scalability.
 | |
| 
 | |
| It is licensed under the Boost Software License which basically means
 | |
| you can do anything you like with it. This does not apply to the malloc.c.h
 | |
| file which remains copyright to others.
 | |
| 
 | |
| It has been tested on win32 (x86), win64 (x64), Linux (x64), FreeBSD (x64)
 | |
| and Apple MacOS X (x86). It works very well on all of these and is very
 | |
| significantly faster than the system allocator on all of these platforms.
 | |
| 
 | |
| By literally dropping in this allocator as a replacement for your system
 | |
| allocator, you can see real world improvements of up to three times in normal
 | |
| code!
 | |
| 
 | |
| To use:
 | |
| -=-=-=-
 | |
| Drop in nedmalloc.h, nedmalloc.c and malloc.c.h into your project.
 | |
| Configure using the instructions in nedmalloc.h. Run and enjoy.
 | |
| 
 | |
| To test, compile test.c. It will run a comparison between your system
 | |
| allocator and nedalloc and tell you how much faster nedalloc is. It also
 | |
| serves as an example of usage.
 | |
| 
 | |
| Notes:
 | |
| -=-=-=
 | |
| If you want the very latest version of this allocator, get it from the
 | |
| TnFOX SVN repository at svn://svn.berlios.de/viewcvs/tnfox/trunk/src/nedmalloc
 | |
| 
 | |
| Because of how nedalloc allocates an mspace per thread, it can cause
 | |
| severe bloating of memory usage under certain allocation patterns.
 | |
| You can substantially reduce this wastage by setting MAXTHREADSINPOOL
 | |
| or the threads parameter to nedcreatepool() to a fraction of the number of
 | |
| threads which would normally be in a pool at once. This will reduce
 | |
| bloating at the cost of an increase in lock contention. If allocated size
 | |
| is less than THREADCACHEMAX, locking is avoided 90-99% of the time and
 | |
| if most of your allocations are below this value, you can safely set
 | |
| MAXTHREADSINPOOL to one.
 | |
| 
 | |
| You will suffer memory leakage unless you call neddisablethreadcache()
 | |
| per pool for every thread which exits. This is because nedalloc cannot
 | |
| portably know when a thread exits and thus when its thread cache can
 | |
| be returned for use by other code. Don't forget pool zero, the system pool.
 | |
| 
 | |
| For C++ type allocation patterns (where the same sizes of memory are
 | |
| regularly allocated and deallocated as objects are created and destroyed),
 | |
| the threadcache always benefits performance. If however your allocation
 | |
| patterns are different, searching the threadcache may significantly slow
 | |
| down your code - as a rule of thumb, if cache utilisation is below 80%
 | |
| (see the source for neddisablethreadcache() for how to enable debug
 | |
| printing in release mode) then you should disable the thread cache for
 | |
| that thread. You can compile out the threadcache code by setting
 | |
| THREADCACHEMAX to zero.
 | |
| 
 | |
| Speed comparisons:
 | |
| -=-=-=-=-=-=-=-=-=
 | |
| See Benchmarks.xls for details.
 | |
| 
 | |
| The enclosed test.c can do two things: it can be a torture test or a speed
 | |
| test. The speed test is designed to be a representative synthetic
 | |
| memory allocator test. It works by randomly mixing allocations with frees
 | |
| with half of the allocation sizes being a two power multiple less than
 | |
| 512 bytes (to mimic C++ stack instantiated objects) and the other half
 | |
| being a simple random value less than 16Kb.
 | |
| 
 | |
| The real world code results are from Tn's TestIO benchmark. This is a
 | |
| heavily multithreaded and memory intensive benchmark with a lot of branching
 | |
| and other stuff modern processors don't like so much. As you'll note, the
 | |
| test doesn't show the benefits of the threadcache mostly due to the saturation
 | |
| of the memory bus being the limiting factor.
 | |
| 
 | |
| ChangeLog:
 | |
| -=-=-=-=-=
 | |
| v1.05 15th June 2008:
 | |
|  * { 1042 } Added error check for TLSSET() and TLSFREE() macros. Thanks to
 | |
| Markus Elfring for reporting this.
 | |
|  * { 1043 } Fixed a segfault when freeing memory allocated using
 | |
| nedindependent_comalloc(). Thanks to Pavel Vozenilek for reporting this.
 | |
| 
 | |
| v1.04 14th July 2007:
 | |
|  * Fixed a bug with the new optimised implementation that failed to lock
 | |
| on a realloc under certain conditions.
 | |
|  * Fixed lack of thread synchronisation in InitPool() causing pool corruption
 | |
|  * Fixed a memory leak of thread cache contents on disabling. Thanks to Earl
 | |
| Chew for reporting this.
 | |
|  * Added a sanity check for freed blocks being valid.
 | |
|  * Reworked test.c into being a torture test.
 | |
|  * Fixed GCC assembler optimisation misspecification
 | |
| 
 | |
| v1.04alpha_svn915 7th October 2006:
 | |
|  * Fixed failure to unlock thread cache list if allocating a new list failed.
 | |
| Thanks to Dmitry Chichkov for reporting this. Further thanks to Aleksey Sanin.
 | |
|  * Fixed realloc(0, <size>) segfaulting. Thanks to Dmitry Chichkov for
 | |
| reporting this.
 | |
|  * Made config defines #ifndef so they can be overridden by the build system.
 | |
| Thanks to Aleksey Sanin for suggesting this.
 | |
|  * Fixed deadlock in nedprealloc() due to unnecessary locking of preferred
 | |
| thread mspace when mspace_realloc() always uses the original block's mspace
 | |
| anyway. Thanks to Aleksey Sanin for reporting this.
 | |
|  * Made some speed improvements by hacking mspace_malloc() to no longer lock
 | |
| its mspace, thus allowing the recursive mutex implementation to be removed
 | |
| with an associated speed increase. Thanks to Aleksey Sanin for suggesting this.
 | |
|  * Fixed a bug where allocating mspaces overran its max limit. Thanks to
 | |
| Aleksey Sanin for reporting this.
 | |
| 
 | |
| v1.03 10th July 2006:
 | |
|  * Fixed memory corruption bug in threadcache code which only appeared with >4
 | |
| threads and in heavy use of the threadcache.
 | |
| 
 | |
| v1.02 15th May 2006:
 | |
|  * Integrated dlmalloc v2.8.4, fixing the win32 memory release problem and
 | |
| improving performance still further. Speed is now up to twice the speed of v1.01
 | |
| (average is 67% faster).
 | |
|  * Fixed win32 critical section implementation. Thanks to Pavel Kuznetsov
 | |
| for reporting this.
 | |
|  * Wasn't locking mspace if all mspaces were locked. Thanks to Pavel Kuznetsov
 | |
| for reporting this.
 | |
|  * Added Apple Mac OS X support.
 | |
| 
 | |
| v1.01 24th February 2006:
 | |
|  * Fixed multiprocessor scaling problems by removing sources of cache sloshing
 | |
|  * Earl Chew <earl_chew <at> agilent <dot> com> sent patches for the following:
 | |
|    1. size2binidx() wasn't working for default code path (non x86)
 | |
|    2. Fixed failure to release mspace lock under certain circumstances which
 | |
|       caused a deadlock
 | |
| 
 | |
| v1.00 1st January 2006:
 | |
|  * First release
 |