Monday, February 15, 2010

STL Strings and reference counting ( again ! )

In the last post, I mentioned that I was (un)fortunate enough to get bitten by some trickiness that some folks put into their stl implementations :). Other than the common data structures and iterators, we (for some definition of "we" which mostly includes my team in my company ) hardly use any stl features. And the one data structure that is most ubiquitously used - well, because I can't imagine programming without it - is "The fun begins when stl implementations begin implementing strings using reference counting. I wrote about some nice (or not so nice) stuff happening related to reference counting some time back, though I guess some may argue that it may not be a common scenario. This post is going to speak about a more common scenario though.

It all starts when one plans to use a reference counted string in a multi-threaded application. The STL implementation has to guarantee that the reference counting is thread safe. The part of reference counting that is not thread safe by default is incrementing/decrementing the reference counters. Strings shared across threads can cause enough problems, and this is well documented and researched in many places. This document goes through how this objective is achieved in some STL implementations found across platforms we use.

The simplest way in which one can guard the reference count is to guard it using a mutex. And this is exactly what the Roguewave implementation in HP - UX ( the version shipped with HP UX is 2.1 ) does when the application is compiled with -mt option. This is the one of the best remedies to kill the performance of an application. As one of our profiling results show, one third of the time was spent in mutexes, a large portion of which was for these mutexes which were guarding the reference counts of strings ! And the worst part is that hardly any strings were being shared across threads - so it was a pure waste.

We had half a mind to actually write our own string classes from scratch with char*. But more specifically, what we were looking for was effectively something as mentioned in this blog post + the comments on that post : a way to enable/disable thread safety on the string class depending on the usage.

Implementations

* HP - UX ( Roguewave 2.1 )

** Disable locks on reference counts}
Roguewave 2.1 does provide an option to disable locks on reference counts altogether. To achieve that, one has to compile the application with the following pre-processor definition -
__HPACC_SKIP_REFCNT_MUTEX. Now that is sort of ok - but not exactly what we were looking for as we do want to control thread safety in certain cases.

Update: It seems the roguewave 2.1 always uses atomic reference counts internally (on itanium) which is wrapped around by a mutex call. So compiling with the above flag will provide atomic reference counts provided by itanium's processor directives ! This should suffice in most cases, I guess :)

** Disable Reference Counts
Roguewave 2.1 also provides an option to disable reference counting altogether via the preprocessor definition - _RWSTD_NO_STRING_REF_COUNT . Now this one seems pretty interesting as one can always wrap a reference counted pointer implementation around it !

** Solaris ( StlPort )
On the Solaris machines that we use, the STL implementation used is STLPort (version 4). It seems there is no reference counted implementation in this version (at least), so no worries about thread safety.

** AIX
The STL implementation used on AIX is the one written by PJ Plauger ( Dinkumware ). It is shipped with customizations ( so says this page) by both IBM and MS ). There are many known issues with the string class shipped with Microsoft Visual C++ 6.0 at least ( documented pretty comprehensively with workaround et all here ).

On the AIX implementation that we use - the reference counting has been disabled altogether using preprocessor tricks :)

** Linux ( On Intel processors )
The STL implementation used is the one provided by GNU. Reference counting is enabled and is provided by means atomic builtins provided by gcc, which internally use the atomic functions provided by Intel

Atomic Builtins
Now, it does seem that atomic increments/decrements will be faster than the standard mutex drama that other solutions provide. These functions are normally implemented using assembly directly, and all platforms ( HP , AIX , Solaris ) do provide some form of it.

I didn't get much performance statistics on the internet from those who have used these operations. The only one I could find related to stl is the one provided by SUN, but the fact that atomic operations facilitated by the processor are faster in some cases and slower in some other doesn't sound nice. One other article I found, which had a pretty nifty way to implement atomic functions by hand using PowerPC aseembly had pretty good performance figures.

Edit: A comprehensive, but bit dated, performance anaylsis of the string class with various implementations (plain alloc, fast alloc, atomic increments, critical sections etc ) can be found here

Solution ?
So, I don't know what is the best solution :). As most of the implementations that we use do not have reference counting at all ( either naturally - in case of Solaris/AIX, or by force on Roguewave STL ), a good option seems to be wrapping the stl string class under hand made reference counted pointer implementations - thread safe and non-thread safe ones - the thread safe ones with atomic builtins perhaps.

To neutralize any changes in the implementation on multiple platforms, I guess we should be using one implementation on all platforms ( STLPort seems very attractive for this ) - but some things are not in my hand :)

Update: http://www.gotw.ca/gotw/045.htm provides a very good comparative analysis of different approaches to implement a string class using actual numbers !

Saturday, December 26, 2009

STL, allocators, etc

Every time I find a link even remotely related to the internals of STL, I get interested in it. One of the reasons is because I've been bit very badly because of its various nitty gritties. There were some very hard to find memory "leaks" first, some weird performance issues and some interesting side effects like the one in the previous blog post.

The performance issues were encountered in a module which was using STL very heavily ( ok, STL was not the only reason for the performance hit - but when the execution time of a process comes down from 6 hours to 2.5 hours just by changing a certain configuration for STL, annoyed me a lot). The only thing i managed to write about it was this one line. Maybe I'll write more about it some other day, just so that I never forget about it.

The memory "leaks" fiasco was very interesting too. The memory usage of some process ( heap size ) was continuously increasing and no memory leak detector could find any leak. There was hardly any explicit pointer handling in the suspect code. Infact it was ridden with all value types with heavy usage of some STL containers. It's still a perfect whodunnit, as no one knows the answer for sure. I suspected memory fragmentation to start with, but then no one thought it made sense, or whether such a thing can happen ( whatever that means :) ). Finally, the case was closed prematurely by someone by citing this page from sgi.

Just recently, I came across a link, which explains in detail how EA Sports wrote their own implementation of the STL - EASTL. I didn't understand the full article as such, but there were some very interesting statements and references in it


...the std allocator design and its use by std containers make it hard to work with and leads to suboptimal performance and code bloat...


...the use of virtual functions is avoided as well, given the potential cache miss they may cause...


...The compiler optimizes for if statements to evaluate as true, so write your code to follow this when possible...


The third point is also emphasized in another article I read recently about the pitfalls of object oriented programming

In addition, it was also interesting to find an article on custom allocators by someone from outside the gaming industry. stackoverflow usually has interesting discussions on almost all topics. There were lots of discussions about custom allocators , one of which mentioning that it's a premature optimization for most things other than game development/embedded devices development. But then there were also discussions whether all the so-called-features of C++ are used in game development at all :).

After having been subjected to all those articles and having understood it only moderately, I thought I'll try my hand at writing an allocator just to check out some simple behaviour by printing some debug messages.

The following program just tracks the memory allocation/deallocation pattern on inserting elements into a vector. It's pretty interesting to see the pattern when the vector doubles its capacity to ensure all the elements are continuous.


#include <iostream>
#include <memory>
#include <cstdlib>
#include <vector>

namespace std {

template <>
class allocator<int>
{

public:
typedef size_t size_type;
typedef ptrdiff_t difference_type;
typedef int* pointer;
typedef const int* const_pointer;
typedef int& reference;
typedef const int& const_reference;
typedef int value_type;

allocator<int>() {}
allocator<int>(const allocator<int>&) {}

template <class U>
struct rebind { typedef allocator<U> other; };

pointer allocate(size_type n, const void * = 0) {
int* t = (int*) malloc(n * sizeof(int));
std::cout<< "Used allocator<int> to allocate "<<n<<" objects at address "
<< t << " (+)" << std::endl;
return t;
}

void deallocate(void* p, size_type n) {
if (p) {
free(p);
std::cout<< "Used allocator<int> to deallocate "<<n<<" objects at address "
<< p << " (-)" <<std::endl;
}
}

pointer address(reference x) const { return &x; }
const_pointer address(const_reference x) const { return &x; }

allocator<int>& operator=(const allocator<int>&) { return *this; }

void construct(pointer p, const int& val)
{
cout<<"Constructing object at "<<p<<endl;
new ((int*) p) int(val);
}

size_type max_size() const { return size_t(-1); }

};

}

int main()
{

std::vector<int> abcd;

for( int i = 1; i <= 10 ; i++ ) {
std::cout<<"Inserting '"<<i<<"' into the vector "<<std::endl;
abcd.push_back(i);
std::cout<<"-----------"<<std::endl;
}

return 0;
}


which printed the following on my 32 bit linux machine


Inserting '1' into the vector
Used allocator<int> to allocate 1 objects at address 0x804c008 (+)
Constructing object at 0x804c008
-----------
Inserting '2' into the vector
Used allocator<int> to allocate 2 objects at address 0x804c018 (+)
Constructing object at 0x804c01c
Used allocator<int> to deallocate 1 objects at address 0x804c008 (-)
-----------
Inserting '3' into the vector
Used allocator<int> to allocate 4 objects at address 0x804c028 (+)
Constructing object at 0x804c030
Used allocator<int> to deallocate 2 objects at address 0x804c018 (-)
-----------
Inserting '4' into the vector
Constructing object at 0x804c034
-----------
Inserting '5' into the vector
Used allocator<int> to allocate 8 objects at address 0x804c040 (+)
Constructing object at 0x804c050
Used allocator<int> to deallocate 4 objects at address 0x804c028 (-)
-----------
Inserting '6' into the vector
Constructing object at 0x804c054
-----------
Inserting '7' into the vector
Constructing object at 0x804c058
-----------
Inserting '8' into the vector
Constructing object at 0x804c05c
-----------
Inserting '9' into the vector
Used allocator<int> to allocate 16 objects at address 0x804c068 (+)
Constructing object at 0x804c088
Used allocator<int> to deallocate 8 objects at address 0x804c040 (-)
-----------
Inserting '10' into the vector
Constructing object at 0x804c08c
-----------
Used allocator<int> to deallocate 16 objects at address 0x804c068 (-)

Monday, December 21, 2009

The joys of reference counting

What is the output of the below program ?


#include <iostream>
#include <string>

using namespace std;

struct ABCD {
string internalString;
};

int main() {

string foobar("foobar");

ABCD* abcd1 = new ABCD;
abcd1->internalString = foobar;

ABCD* abcd2 = abcd1;

delete abcd1;
delete abcd2;

string barfoo("foo");

cout<<foobar;
}


On the machine that I ran it - the output was "foo" :)

The Problem

One of the applications I was working with has a singleton class which stores some reference data for global access. The singleton contains a map whose key is a string which represents some code and the value is information about the code. The problem was that after processing some data, some of the codes used to go missing from the map. There were some fascinating things to note about the problem :
  • The map had the code as the key of the map. This key was stored by value.
  • The entry for the code that went missing was not removed from the map.
  • The value of the key where the code earlier used to reside, changed to a valid, beautiful looking ( well, it didn’t look beautiful when I was debugging the problem :) ) string. No corruptions. No dingbats.
As we all know, most stl implementations of map are a balanced binary tree implemented using red black trees. Which means searching the map by the key will involve comparing the value to be searched with other members of the map. The other members decide whether to go right, left or wherever to continue searching in the red black tree. Now, if the value to be searched for happened to be anywhere near the sphere of influence of the which went missing earlier, then the probability that this value to be searched will be reported as missing goes up ten folds. And it did happen. So, we had a beautiful problem to solve ( it’s easier to
say this after it’s solved :) )

Through the Looking-Glass, and What a C++ Developer Found There

I don’t care whether they call Richard Stallman an idiot, a fanatic , non-pragmatic buffoon or whatever. First he created emacs to save us from $sucky editors. and later he created gdb to make Bjarne Stroustrup’s creation more tolerable. And as HP took the liberty to add run time checking (RTC) to help solve all the chaos inflicted by C++ - there still was some hope ( The problem was reported on HP Itanium )

I ran the program from within gdb with honest intentions to find memory corruptions. And I found a double delete ! A confused C++ developer commented out all related portions of the code which was deleting stuff to check what happens - and all was well in the world again. What the .. ? The only plausible explanation to the case of the missing code - which was stored by value - seemed to be related to reference counting.
Thus the sample program was born.

The Annotated C++ Program

STL strings are inherently reference counted. Every time it’s copied - the reference count is incremented. Every time it is destroyed - the reference count is decremented. Once the reference count goes down to 0, the memory is freed.


#include <iostream>
#include <string>

using namespace std;

struct ABCD {
string internalString;
};

int main() {

string foobar("foobar"); // reference count is 1

ABCD* abcd1 = new ABCD;
abcd1->internalString = foobar; // reference count is incremented to 2

ABCD* abcd2 = abcd1; // no change in reference count here

delete abcd1; // reference count decremented to 1
delete abcd2; // So does it crash ? Does it crash ?
// No !!!! reference count decremented to 0 and
// memory released to the wild to be reallocated

string barfoo("foo"); // aha, reallocation algorithm finds a small enough
// place for this in the same location where foobar
// had its string

cout<<foobar; // so, now can you guess what this will print ?
}



All those who have perfect 20:20 vision, will ask whether we have a double free in hand when the string foobar goes out of scope. The answer is a big YES.

This can be explained by the following two snippets of code. The code is specific to Roguewave STL - the implementation of STL used in HP Itanium in my office.


#define private public // to access all the private stuff
// of std::string
#include <iostream>
#include <string>

using namespace std;

struct ABCD {
string internalString;
};

int main() {

string foobar("foobar");

ABCD* abcd1 = new ABCD;
abcd1->internalString = foobar;

ABCD* abcd2 = abcd1;

delete abcd1;

// access the internal reference count
cout<<"abcd2's reference count is :"
<<abcd2->internalString._C_pref()->__references()
<<endl;

delete abcd2;

cout<<"foobar's reference count is :"
<<foobar._C_pref()->__references()
<<endl;
// if foobar had gone out of scope here ,
// we'd be having a double delete condition.

string barfoo("foo");

// As both foobar and barfoo will be pointing to the same thing
// and as barfoo "captured" foobar's memory
// both of the below reference counts will be 1
cout<<"barfoo's reference count is :"
<<barfoo._C_pref()->__references()
<<endl;
cout<<"foobar's reference count is :"
<<foobar._C_pref()->__references()
<<endl;

cout<<foobar;

// When either of foobar goes out of scope
// it will cause a double delete.
}



The output is :

abcd2's reference count is : 1
foobar's reference count is : 0
barfoo's reference count is : 1
foobar's reference count is : 1
foo


Now, on destruction, the destructor of the basic_string template needs to decide whether to deallocate the memory or not. It decides this by evaluating the following code. Again, this is specific to Roguewave STL


if( _C_pref()->__references() == 0 ||
_C_pref()->__removeReference() == 0)


The second part of the or clause decrements the reference count and checks f it has reached 0 - this happens when we deleted abcd2. The first part happens when foobar is auto destroyed.

One size doesn’t fit all

When all explanations fail, only one prevails - Implementation Specific :). As with so many things which involve STL and C++, even the above problem is platform/implementation specific.
  • When the same program runs on Linux it segfaults on the line delete abcd2.
  • The program runs smoothly on AIX. The stl code was too cryptic for me to understand. It seemed that the problem should have occured in AIX too from what I understood from the stl code, but it didn’t happen. There can be two explanations to this. 1. Either I didn’t understand the code, which I’m almost sure of. or 2. The memory allocated for barfoo didn’t occupy the same as the one for foobar. I even tried to allocate 10000 strings just to check, but the problem failed to appear
  • The scenario was reproduced in Solaris as the STL implementation used in Solaris is Roguewave STL too (in the machines I tried on). The same behaviour was not observed when I changed the implementation to stlport4 though. Yet to check why.
  • When the program is compiled with -DRWSTD_NO_STRING_REF_COUNT i.e reference counting for strings is disabled in Itanium, everything works like a charm, as expected.
Memory related problems never fail to create enough drama to prevent documenting it. Every line in the program explained was atleast 10 call stacks away from each other in the actual code in which the problem happened. Once upon a time with C, only human errors seemed to cause tricky problems to solve. With C++, and more so with STL, the extra level of abstractions and automatic memory management baggage have added the ++ to the trickiness too, it seems.

Friday, March 27, 2009

Ditched Wishes

The only difference between sticking a steel/aluminium dish from my kitchen, out of the window, instead of the Dish TV antenna, is that I can atleast trouble someone in my opposite building with the reflection from the sun with the kitchen dish. Both, anyway, can't seem to brighten the CRT of my idiot box.

Such pathetic customer service. No SRK, I'm not santusht.

Wednesday, March 11, 2009

Ah, Virtual Memory ...

I seem to be on some sort of a hot treasure trail finding gold at the speed of thought. I found this no-words-to-praise explanation of how the kernel handles virtual memory/paging and all that niceties , written by Jeff Berryman, University of British Columbia, more than 3 decades ago, and lying in many places on the interweb now.
THE PAGING GAME

Rules

1. Each player gets several million things.

2. Things are kept in crates that hold 2048 things each. Things in
the same crate are called crate-mates.

3. Crates are stored either in the workshop or warehouse. The workshop is
almost always too small to hold all the crates.

4. There is only one workshop but there may be several warehouses.
Everybody shares them.

5. Each thing has its own thing number.

6. What you do with a thing is to zark it. Everybody takes turns zarking.

7. You can only zark your things, not anybody else's.

8. Things can only be zarked when they are in the workshop.

9. Only the Thing King knows whether a thing is in the workshop or in a
warehouse.

10. The longer a thing goes without being zarked, the grubbier it
is said to become.

11. The way you get things is to ask the Thing King. He only gives out
things in multiples of eight. This is to keep the royal overhead down.

12. The way you zark a thing is to give it thing number. If you give the
number of a thing that happens to be in a workshop it gets zarked right
away. If it is in a warehouse, the Thing King packs the crate containing
your thing back into the workshop. If there is no room in the workshop, he
first finds the grubbiest crate in the workshop, whether it be yours or
somebody else's, and packs it off with all its crate-mates to a warehouse.
In its place he puts the crate containing your thing. Your thing then gets
zarked and you never knew that it wasn't in the workshop all along.

13. Each player's stock of things have the same numbers as everybody else's.
The Thing King always knows who owns what thing and whose turn it is, so you
can't ever accidentally zark somebody else's thing even if it has the same
number as one of yours. (VS/2)

Notes

1. Traditionally, the Thing King sits at a large, segmented table and is
attended to by pages (the so-called "table pages") whose job it is to help
the king remember where all the things are and who they belong to.

2. One consequence of Rule 13 is that everybody's thing numbers will be
similar from game to game, regardless of the number of players.

3. The Thing King has a few things of his own, some of which move back and
forth between workshop and warehouse just like anybody else's, but some of
which are just too heavy to move out of the workshop.

4. With the given set of rules, oft-zarked things tend to get kept
mostly in the workshop while little-zarked things stay mostly in a warehouse.
This is efficient stock control.

5. Sometimes even the warehouses get full. The Thing King then has to start
piling things on the dump out back. This makes the game slower because it
takes a long time to get things off the dump when they are needed in the
workshop. A forthcoming change in the rules will allow the Thing King to
select the grubbiest things in the warehouses and send them to the dump in
his spare time, thus keeping the warehouses from getting too full. This
means that the most infrequently-zarked things will end up in the dump so
the Thing King won't have to get things from the dump so often. This should
speed up the game when there are a lot of players and the warehouses are
getting full. (Not applicable to VS/1)

LONG LIVE THE THING KING

De-bug-ging

Seemingly, with nothing original that I can think of myself, I am finding lots of things I find here and there pretty interesting. Especially this one on some mailing list that I found via reddit, which describes one of the best ways of debugging ;)

We called it the Rubber Duck method of debugging. It goes like this:

1) Beg, borrow, steal, buy, fabricate or otherwise obtain a rubber duck
(bathtub variety)
2) Place rubber duck on desk and inform it you are just going to go over
some code with it, if that's all right.
3) Explain to the duck what you code is supposed to do, and then go into
detail and explain things line by line
4) At some point you will tell the duck what you are doing next and then
realise that that is not in fact what you are actually doing. The duck
will sit there serenely, happy in the knowledge that it has helped you
on your way.

Works every time.

To think of it, I missed an opportunity to use my cat more effectively ;)

Sunday, March 08, 2009

Sharia, anyone ?

Vir Sanghvi had writen this nice article (as usual) in the Hindustan Times editorial page, regarding the sorry state of Pakistan. There was this small part about the imposition of the Sharia Law in the infamous Swat valley, and Imran Khan's comments on it.

Imran Khan (Keble College, Oxford, 1973-76) even declared that sharia law would be better because justice would be dispensed more swiftly! (I know this is politically incorrect but the Loin of the Punjab’s defence of sharia law reminded me of the famous Private Eye cover when his marriage to Jemima Goldsmith was announced. The Eye carried a picture of Khan speaking to Jemima’s father. “Can I have your daughter’s hand?” Imran was supposedly asking James Goldsmith. “Why? Has she been caught shoplifting?” Goldsmith replied. So much for sharia law.)

Ah, Sharia, how funny you are.

Friday, October 24, 2008

The Renaissance Man

I have often regretted the way I went around doing things in college. Or perhaps I didn't. My main problem was that my mind used to be in a very unsteady state, sometimes agreeing and sometimes disagreeing with itself. The fact that the only one I argued about these things was me myself, was another great source of confusion.

At one end I believed that I didn't get enough from the other side, and at the other end I felt that I wasn't doing much from my side. And on some other end I felt I was in a wrong place with the wrong people who neither had the concept of 'ends' nor starts. There never were any in-between thoughts, only extra dimensions, and all thoughts lurked at obscure corners of these dimensions.

And when I was about to give up, we got a charismatic leader whom I started adoring, and then eventually started loathing for playing double games. But I guess the charisma part did prove one thing; however disoriented and random the followers are, there's always a line of thought some leader can infuse which can give directions. It doesn't really matter whether the system supports it as much.

I guess that's what Barkha Dutt was saying about the renaissance man.