malloc() - Githubissues

nomemory commented 3 years ago

I think the problem lies in the fact that the C code is using malloc() (a lot). Which is not exactly as "performant" as the new ... in Java.

For this particular use case (your benchmark), you need to probably switch from malloc() to something else - something that's efficient in allocating small blocks of the same size (e.g.: Nodes).

http://dmitrysoshnikov.com/compilers/writing-a-pool-allocator/

PS: Think of malloc() as something general-purpose, that's suitable to allocate everything, from small to huge chunks of memory.

morisil commented 3 years ago

Yes, adding more memory management to the native solution will make it probably even more performant than Java. No doubts about it. But the whole point of this project is to falsify the statement like "automatic memory management is always slowing applications down".

wbrickner commented 3 years ago

Not familiar with the internals of the JVM but I would suspect that the new keyword in Java does not often become a real malloc call under the hood, I bet it preallocates a memory arena and then has a fast way of dealing that memory out that is custom-fit to the typical lifecycle patterns of objects in Java.

Your C code has no such benefit. To match the performance of Java, try implementing a simple memory arena allocator.

skroll commented 3 years ago

Right, an allocation in the JVM just allocates from the heap. The JVM allocates part of the heap at the startup, so each call to new means that it's just pulling from the heap. Using a pool allocator in the C code would actually be a more accurate comparison.

cfriedt commented 3 years ago

Yep. That's exactly the reason.

boomlinde commented 3 years ago

In this case a simple one-slot memory reuse cache will suffice, since the code spends the bulk of its time freeing memory just to allocate the same amount of memory for the same type of object again immediately afterwards. Or you could just not dispose+newNode in the first place but modify the node in-place.

diff --git a/src/main/c/java_2_times_faster_than_c.c b/src/main/c/java_2_times_faster_than_c.c
index 5d78e7a..f05b59d 100644
--- a/src/main/c/java_2_times_faster_than_c.c
+++ b/src/main/c/java_2_times_faster_than_c.c
@@ -31,8 +31,14 @@ struct Node {
   long id;
 };

+Node *lastDisposed = NULL;
+
 struct Node *newNode(long id) {
-  Node* node = malloc(sizeof(Node));
+  Node *node = lastDisposed;
+  if (!node)
+    node = malloc(sizeof(Node));
+  else
+    lastDisposed = NULL;
   node->id = id;
   return node;
 }
@@ -47,7 +53,7 @@ void join(Node *root, Node *node) {
 void dispose(Node *node) { // delete is reserved in C
   node->next->previous = node->previous;
   node->previous->next = node->next;
-  free(node);
+  lastDisposed = node;
 }

 void insert(Node *previous, Node *node) {

This trivial optimization turns 1m7s runtime into 14s on my system and has the same memory use characteristics as the original.

With this in mind, we can easily avoid dynamic allocation altogether, e.g.

diff --git a/src/main/c/java_2_times_faster_than_c.c b/src/main/c/java_2_times_faster_than_c.c
index 5d78e7a..bf053b0 100644
--- a/src/main/c/java_2_times_faster_than_c.c
+++ b/src/main/c/java_2_times_faster_than_c.c
@@ -20,7 +20,7 @@
 #include <stdio.h>
 #include <stdlib.h>

-const int NODE_COUNT = 1000;
+#define NODE_COUNT 1000
 const long TRAVERSAL_COUNT = 5000000000L;

 typedef struct Node Node;
@@ -31,8 +31,17 @@ struct Node {
   long id;
 };

+Node *lastDisposed = NULL;
+
+Node mem[NODE_COUNT];
+size_t nodei = 0;
+
 struct Node *newNode(long id) {
-  Node* node = malloc(sizeof(Node));
+  Node *node = lastDisposed;
+  if (!node)
+       node = &mem[nodei++];
+  else
+    lastDisposed = NULL;
   node->id = id;
   return node;
 }
@@ -47,7 +56,7 @@ void join(Node *root, Node *node) {
 void dispose(Node *node) { // delete is reserved in C
   node->next->previous = node->previous;
   node->previous->next = node->next;
-  free(node);
+  lastDisposed = node;
 }

But the runtime improvement at that point is only marginal on my system, since there are only a thousand allocations altogether.

Yes, all, this will be done for you for behind the scenes by the Java memory manager. In C, you have to control memory yourself. That's its advantage and disadvantage compared to Java. I'd argue that a performance minded C programmer would recognize the free-shortly-followed-by-malloc quickly and implement a faster solution in the first place.

Personally, I would not have any of the node functions manage memory. newNode would become initNode and dispose would become unlink. Then you'd initNode your recently unlinked node in place of freeing and allocating.

boomlinde commented 3 years ago

Here's how that looks btw:

diff --git a/src/main/c/java_2_times_faster_than_c.c b/src/main/c/java_2_times_faster_than_c.c
index 5d78e7a..8e6678c 100644
--- a/src/main/c/java_2_times_faster_than_c.c
+++ b/src/main/c/java_2_times_faster_than_c.c
@@ -20,8 +20,8 @@
 #include <stdio.h>
 #include <stdlib.h>

-const int NODE_COUNT = 1000;
-const long TRAVERSAL_COUNT = 5000000000L;
+#define NODE_COUNT 1000
+#define TRAVERSAL_COUNT 5000000000L

 typedef struct Node Node;

@@ -31,8 +31,7 @@ struct Node {
   long id;
 };

-struct Node *newNode(long id) {
-  Node* node = malloc(sizeof(Node));
+struct Node *initNode(Node *node, long id) {
   node->id = id;
   return node;
 }
@@ -44,10 +43,9 @@ void join(Node *root, Node *node) {
   node->next = root;
 }

-void dispose(Node *node) { // delete is reserved in C
+void unlink(Node *node) {
   node->next->previous = node->previous;
   node->previous->next = node->next;
-  free(node);
 }

 void insert(Node *previous, Node *node) {
@@ -58,11 +56,13 @@ void insert(Node *previous, Node *node) {
 }

 int main() {
+  static Node mem[NODE_COUNT];
+  static size_t nodei = 0;
   long nodeId = 0;
-  Node *head = newNode(nodeId++);
-  join(head, newNode(nodeId++));
+  Node *head = initNode(&mem[nodei++], nodeId++);
+  join(head, initNode(&mem[nodei++], nodeId++));
   for (int i = 2; i < NODE_COUNT; i++) {
-    insert(head, newNode(nodeId++));
+    insert(head, initNode(&mem[nodei++], nodeId++));
   }
   Node *toDelete = head;
   Node *toInsert = head;
@@ -72,9 +72,9 @@ int main() {
     if (toInsert == toDelete) {
       toInsert = toInsert->next;
     }
-    dispose(toDelete);
+    unlink(toDelete);
+    insert(toInsert, initNode(toDelete, nodeId++));
     toDelete = prevToDelete;
-    insert(toInsert, newNode(nodeId++));
   }
   long checksum = 0;
   head = toInsert

morisil commented 3 years ago

@boomlinde thank you for amazing comments and feedback. I updated the README with explanation why I am not convinced to change anything in the example C code. I see that I made a mistake by skipping something I had in code initially. My intent was to make nodes of variable size, with unpredictable amounts of them. But I skipped it for the sake of simplicity and it was a mistake, because simplicity obscured conceptual value of what I wanted to demonstrate. But thank you again for the time and effort spending on analyzing this with me.

winrid commented 3 years ago

@morisil The problem is that if we're comparing real life C to Java, this isn't how you'd write the C code. You would use an arena/pool allocator.

That's why it seems odd, because we're not comparing real life examples.

Maybe it would be worth having more than one version of the C/Java code. You could use an allocator on top of the heap in Java as well, to compare two "optimized" versions.

Remember, you're already doing memory management in the C version - just inefficiently (no offense).

Anyway this is a neat benchmark, thanks for making it.

morisil commented 3 years ago

@winrid "You would use an arena/pool allocator.", would it also apply if allocated nodes are of unpredictable size? This is typical use case I had in mind because it reflects specifics of the server side code I am usually dealing with. And it was a mistake that I didn't include it in the example because now it's obscuring my own intentions. :( Imagine cache server, being constantly requested for data through http, and in the same time constantly updating invalidated items in the cache.

BTW Maybe you can tell me what would be best functional equivalent of Java byte[] data = new byte[654] in C, so I can still quickly do data.length?

Regarding multiple versions - definitely, if you want to contribute one, I would be very happy to include it in the project. But please take unpredictable size of data structures into consideration.

winrid commented 3 years ago

@morisil One solution is that when you ask the allocator for an amount of memory, if there isn't a block that can fit that amount, you combine empty ones or request one from the OS.

Regarding byte[size] - why not create a wrapper structure that stores the size?

I'll see about adding the two versions.

cfriedt commented 3 years ago

On the topic of a pool allocator, in C, one approach here is to simply not use malloc() but statically allocate an array sized to the maximum number of nodes that you foresee yourself requiring. Then, you could use a number of ways to keep track of which statically allocated nodes were in use. E.g. a bit in the node structure itself, or a separate array of uint64_t, where each bit represents the availability of an allocated node. In the latter case, a simple and effective way to find an unallocated node, would be to use the __builtin_ffs().

Likely, if that approach were taken, then the C implementation would drastically outperform the Java version (just a guess).

morisil commented 3 years ago

BTW I wanted to improve the clarity of my intentions behind this algorithm by allocating data structures of unpredictable size, as it would rather happen in real life. For this reason I quickly tested deterministic "almost" random distribution I often use in GLSL (borrowed from The Book of Shaders). I also made isolated benchmark just for this function, expecting C to quite outperform Java version (latest commit). But it seems to be contrary. Java is slightly faster. I would be happy to find the explanation of this one. I suspect HotSpot doing some aggressive inlining possible after the running code is analyzed for a while. Another possibility would be that functions called in the math library in C are actually not inlined, which is hard for me to believe. If there is no reasonable explanation, I will disassamble C code and try to dump the actual assembly of the compiled machine code from HotSpot, which I know that it's technically possibly, but haven't tried so far:

https://wiki.openjdk.java.net/display/HotSpot/PrintAssembly

morisil commented 3 years ago

I updated the project with Java and C versions operating on allocations of unpredictable size which gave even more performance boost to Java version. I would be happy to create a flavor of C code with proper memory management, I don't feel competent enough though to do it in reasonable time.

skroll commented 3 years ago

Awhile ago I wrote a tiny library for this very sort of thing. I'll try porting the code to see how it works: https://github.com/skroll/libpsca It doesn't have to be a library, it's something that can be just dropped in.

hakanai commented 1 year ago

On the topic of a pool allocator, in C, one approach here is to simply not use malloc() but statically allocate an array sized to the maximum number of nodes that you foresee yourself requiring.

I know from writing game engines in Java, that the same sort of object reuse can also benefit Java. So of course if someone were to make a new demo which did reuse objects in the C version, I'd trust they would also provide an equivalent update for the Java version, so that it can remain an apples-to-apples comparison.

Updating only the C version and not the Java version would be something you'd do if you just wanted to mislead the audience. :)

xemantic / java-2-times-faster-than-c

malloc() #1