Cell data structures are huge and kill cache locality

ElkPal commented 7 years ago

TL;DR: Start by noticing how TextCell was almost 1kb in size - now 1/3rd of it - and is used to store single characters. And then realize that the entire cell system is like that: multiple huge data structures, each often allocating on its own (large wxStrings, etc).

The issue was originally about a symptom, now it indicates the root cause. This should be the top meta-issue for making cell data structures smaller and more cache-friendly, and about other impacts on the concise benchmark

Progress

[ ] #1445 Remove m_nextToDraw - 8 byte reduction in most derived cell types
[ ] #1335 Factor out CellList - 36 byte reduction, from 112 to 76 bytes
[ ] #1400 Factor out CellContext - 8 byte reduction, from 76 to 68 bytes, plus 24 byte reduction in TextCell, and a huge reduction in the allocations made by the text size cache there - in common cases it brings the extra cache allocations to zero, since the cache will be shared
[ ] Replace the virtual table pointer by an index into a type table - 8 byte reduction, to 60 bytes
[ ] Make Observer an empty base, used only when particular debugging modes are enabled; the CellPointers tracking should be performed without use of CellPtr. Specifically, bool Cell::m_referredByContext : 1 would make the cell "dirty" and cause an assertion in the destructor. It would be the CellContext's job to clear this flag and clear any CellPointers pointing to the cell before the cell gets destroyed. - 8 byte reduction, to 52 bytes (!).
[x] #1334 Small TextCells
[x] Reduce internal cells - allocate them on demand.
[x] Reduce the overhead of decoding maxima network output (there's a large bottleneck in present code due to the use of GetChar() in socket receive)

Background

The must-know background for this issue is provided in the following resources:

Avenues of Pursuit

There are multiple directions of attack for this issue.

Making incremental reductions in Cell type sizes - usually by judicious changes in member types, or by moving members out of Cell to derived classes, since Cell itself forces its size on all the others. 3af402dc is an example of such change.
Moving data that doesn't really belong in the Cell out of it. For example, most of the ..._old types are related to the state of the view of the cells. This is a whole different direction of attack. af3145fe is an example of such change.
Reducing the number of sub-allocations in Cells. This may require changes to data types/structures etc.
Contiguous cell storage: think of a "tape" memory arena, not of individually allocated list nodes.
... there's lots more - it is truly death by a million little cuts. There's no single piece of code that is responsible: the entire design of the Cell data type, and how it's used, is the culprit.

It should be noted that almost none of these improvements change algorithms to decrease computational complexity. This is an extremely important point. Algorithmic improvements revolve around changing higher-order polynomial time operations to linear or O(1) - but those usually stand out in profiles like sore thumbs, and involve localized changes in code. Instead, the data structure changes will touch lots of code at once (although mostly in a mechanical way), and there'll be no single contributor in profiles, but many - each function using a given data structure will have a slightly lower cost with every improvement.

Finally, let's quote Chandler Carruth: "You don't care about performance you don't measure". In other words: if we don't measure it, we'll constantly break it, so we can't claim at once that we care about performance and not do measurements to ensure that what we do actually improves anything. Benchmarking performance and resource use will have to become a part of unit tests, and significant performance regressions shall be treated as test failures.

This will be a long process, but I'm reopening this issue as @ElkPal has, indirectly, pointed to this larger issue.

I want to evaluate the command line

ratsimp((x-a)*(x-b)*(x-c)*(x-d)*(x-e)*(x-f)*(x-g)*(x-h)*(x-i)*(x-j)*(x-k)*(x-l)*(x-m)*(x-n)*(x-o)*(x-p)*(x-q)*(x-r)*(x-s)*(x-tt));

But after a long time(about 90 minutes) wxMaxima breaks displaying a message window, telling me

    An unhandled exception occured. Press 'Abort' to terminate the program.
    'Retry' to exit the program normally and 'Ignore' to try to continue

For me it seems, that wxMaxima executes commands like above within 3 steps

first: maxima is calculating (took about 5 minutes)
second: reading maxima output (took about 90 minutes)

Then wxMaxima broke evaluating. Normally a third step will follow when using a "shorter" command(maximal ...(x-s) )

-third step: parsing ouput

For me it seems, that wxMaxima has a kind of "array-overflow" or something like that. I tried the Maxima command line: it worked by principle but was unusable because of monitor overflow. I also tried xMaxima: it worked up to parenthesis ...(x-uu) , but when going on with (x-vv) I got also an overflow message. Note: I want to copy & paste the result of wxMaxima to the raytracing package POVRay. And I need the output format of wxMaxima, cause only then POVRay recognizes < * > - symbols as the multiplying operator. All calculations are done one a notebook with a quad core i7 4810 MQ(Haswell) with 32 GB RAM; 250 GB SSD & 500 GB HD

wxMaxima version: 16.4.2 Maxima version: 5.38.1_5_gdf93b7b_dirty Maxima build date: 2016-05-13 22:33:49 Host type: i686-w64-mingw32 System type: gcc -mno-cygwin -g -O2 -W -Wswitch -Wcomment -Wpointer-arith -Wimplicit -Wreturn-type -Wmissing-declarations -Wno-sign-compare -Wno-format-nonliteral -O2 -fexpensive-optimizations -falign-functions=4 -D_WIN32 -DENABLE_UNICODE -I/usr/local/include -DDYNAMIC_FFI -I. -L/usr/local/lib -lintl /usr/local/lib/libreadline.dll.a -L/usr/local/lib -ltermcap /usr/local/lib/libavcall.a /usr/local/lib/libcallback.a -luser32 -lws2_32 -lole32 -loleaut32 -luuid -liconv -L/usr/local/lib -lsigsegv libgnu_cl.a SAFETY=0 HEAPCODES STANDARD_HEAPCODES GENERATIONAL_GC SPVW_BLOCKS SPVW_MIXED TRIVIALMAP_MEMORY libsigsegv 2.8 libiconv 1.13 libreadline 6.0 GNU C 3.4.5 (mingw-vista special r3) PC/386 Lisp implementation type: CLISP Lisp implementation version: 2.49 (2010-07-07) (built on toshiba [192.168.43.3])

gunterkoenigsmann commented 7 years ago

It seems like github has mistaken parts of the equation as formatting information. I've taken the liberty to correct this. Your observation of the 3 steps is correct. ...let's see if the problem is reproducible on my computer: With any luck at all the nightly build that can be found at https://launchpad.net/~peterpall/+archive/ubuntu/wxmaxima-nightlies does no more have this problem. `

gunterkoenigsmann commented 7 years ago

I have less memory installed. Maybe this is the problem. ...but on my system it is maxima that runs out of memory.

One thing you could try in order to get the result in a format povray might understand is to start maxima and to enter:

 display2d:false;

Maybe this makes it output the data in the right format.

ElkPal commented 7 years ago

Hello Gunter, It workes,....by principle.

I entered in wxMaxima < display2d:false > and the evaluation with the last term (x-tt) was finished after 3 hours 45 minutes. I "dragged & dropped" some lines to POVRay which accepted this as practical.

But, the output format of the evaluation by itself is, let's say, of lowly quality: only about one-third of the monitor width is used. Very often only one operation is written in one line, e.g. < *h > or < + f >. It would be a great thing, if I could "increase" the width of the output format to that size of display2d:true. I do not require the < display2d:true > format by itself, cause it is a little bit sensitive when dragging & dropping. I'd like to have used the complete monitor width. The evaluation seized about 180 thousand lines; I expected about 60 thousand.

I did not look into the .../ubuntu/... Website, cause I'm using Win7 Prof 64 bit.

best regards

gunterkoenigsmann commented 7 years ago

Arrgh... ...for windows the nightly build is here: http://andrejv.github.io/wxmaxima/develop.html#WindowsNightly ...but it seems to be too slow for this problem to be of practical use, at least on my computer.

What about the following?

 display2d:false;
 with_stdout ("data.txt", ratsimp((x-a)*(x-b)*(x-c)*(x-d)*(x-e)*(x-f)*(x-g)*(x-h)*(x-i)*(x-j)*(x-k)*(x-l)*(x-m)*(x-n)*(x-o)*(x-p)*(x-q)*(x-r)*(x-s)*(x-tt)));

Kind regards, Gunter.

gunterkoenigsmann commented 7 years ago

If that doesn't help I'm rather out of options. But perhaps one of the excellent mathematicians of the maxima team at https://maxima.sourceforge.net knows an answer...

ElkPal commented 7 years ago

Hello Gunter,

I tried your suggestion but had no success. The output format didn't change to a better utilization of the monitor width; so I stopped the evaluation run by task manager. Well, and with the nightly build......I'm not a software developer and I see me unable to create a wxmaxima.exe using such a nightly build. I will continue working using the command line < display2d:false > . Maybe in the next official release the issue has gone. Additionally I will create an ubuntu VMware session and look how wxmaxima works there

best regards

gunterkoenigsmann commented 7 years ago

You can get a .zip containing the current wxmaxima.exe and all .dll files needed to make them run from Wolfgang's homepage, see: http://andrejv.github.io/wxmaxima/develop.html#WindowsNightly

One additional question, just for interest: If ratsimp doesn't actually seem to make your equation simpler... ...what is the use of using it for this problem?

gunterkoenigsmann commented 7 years ago

Ah... ...and if you dare to try replacing your wxMaxima.exe and the two .dll files (without them it will crash) with a nightly build and you want to use printing: Better wait for tomorrow. Have just fixed 2 bad bugs I introduced there...

ElkPal commented 7 years ago

Hello Gunter, there are several things to report. Since my last entry in the "issue list" I tried to run wxMaxima in a linux distribution. So I created several VWware sessions but failed using ubuntu, open suse and knoppix. I must admit, that I did't look in detail why I failed. But I had at least a little success using debian linux 8.5.0 64 bit which came with wxMaxima 13.04.2 (wxwidgets 3.0.2; Maxima 5.34.1) Here I could evaluate the until term ....*(x-tt) and the output seized nearly the complete monitor-width; it was absolutely OK. Then I chose ratsimp with (x-uu) as the last term, started the evaluation in the late evening and next morning I found out that wxMaxima has closed itself like using the -Exit- command without presenting any message. What a pity! Cause saving the result in the UTF-8 format is much better than in wxmx format.

Your suggestion using the nightly build of wxMaxima.exe(back again using win10 on a desktop-PC) together with the two .dll's: I tried

with (x-tt)*(x-uu) as the last terms. But after a very long time of calculating and obviously at the end of step 2 wxMaxima broke sending an error window Wxmaxim ***Caught unhandled unknown exception; terminating I pressed the OK button and got the message (german language presetted) I do not believe that it is a problem with maxima or even with lisp, cause I got a result with ...(x-uu) using the command line Coming to your question: what do I want to do? I want to describe a function like the gaussian ditributian curve or a cam disc via a polynomial using the newton interpolation formula. I need the polynomial representation; I cannot use the iterative calculation of individual points. And I found out, that the order of the polynomial has to be at least the value of 20. best regards

gunterkoenigsmann commented 7 years ago

German is fine for me, as is italian, french, spanisch and portugese (if I don't have ro write in any of these languages, that is ;-) )

Sounds like the nasty out-of-memory behavior operating systems that allow for memory commitment show: On old systems a malloc() returned a pointer to the memory that was allocated - or a NULL pointer if there was not enough memory left.

On modern systems malloc() can also return a pointer to a virtual address when the system hopes that some other process doesn't use all the memory it has allocated and therefore there might be memory available - that is if this malloc'ed memory chunk is actually used.

This means: malloc() returns a valid pointer, but if you try to actually write to the piece of memory you got it is too late to inform the application that there is no memory and the application (or any other application your operating system feels to be less important) will get killed. Sometimes in this situation there is no memory left to send a network packet to wxMaxima that maxima has died...

...what my question intended to do was: would horner(), expand() or something similar work for you instead of ratsimp() and would the result perhaps be better suited for your task? Since the original equation from your bug report looked quite handy and it was already symbolic I was also wondering if it would handled in a better way with the functions that work on rational forms of polynoms. Don't know if the guys from the maxima-discuss list on maxima.sourceforge.net would have better ideas as most of them are real mathematicians...

Kind regards,

 Gunter.

gunterkoenigsmann commented 7 years ago

I have taken the liberty to forward this problem to the maxima-discuss mailing list. The answer from Richard Fateman was:

This is an expansion of ratsimp((x-a)(x-b)(x-c)(x-d)(x-e)(x-f)(x-g)(x-h)(x-i)(x-j)(x-k)(x-l)(x-m)(x-n)(x-o)(x-p)(x-q)(x-r)(x-s)*(x-tt));

which takes a long time, and space, and maybe runs out of space and maybe can't be displayed.

how large is this expression? we could figure this out mathematically, or we could hack a program to compute it. h(n):=length(ratexpand(product(x-a[i],i,1,n)));

we find h(2)=4, h(3)=8, h(4)=16. Now, guessing at the pattern here, we figure that h(20) is 2^20, or about 1 million. If we want to see the results of this expansion and pack 5 terms per 3-line section, to allow for superscripts, we have the result that we could print out this expression on paper, 80 lines per page, in 7,864 pages. or 15.7 reams of paper. This would weight 78.6 pounds or 35 kilograms.

Now if we just want to store it, consider that each coefficient is maybe 8 bytes, and along with the coefficient we have exponents for up to 20 different variables, so that's another 160 bytes, and then various overheads for storing lists and such, ... so we are talking about maybe 300 megabytes for the CRE form. See below for another estimate...

That's what you would be using if you did rat( ....).

But by requesting ratsimp(...) you required that this be computed, and then converted to general representation, which explicitly represents each + and *, not by position in a list of coefficients, etc.

computing rat(product(x-a[i],i,1,15)) without displaying it, takes 0.0320 sec and uses 5.05 megabytes

computing ratsimp(product(x-a[i],i,1,15)) without displaying it, takes 0.610 sec and uses 191 megabytes

in fact rat(...20) is quite doable. It takes 2.4 seconds and uses 160mb

on my desktop computer.

ratsimp(... 20 ) would, I predict, "use" 2.7 gigabytes. If my system didn't refuse to do it.

So using ratsimp instead of rat increases the time and space by nearly 20X. And makes some problems exceed storage.

I don't know how the results from showtime:all actually correspond to the sizes of the resulting expressions, but it is a simple way of looking at whether you are trying to do something that uses a great deal of memory.

Now for that to be run through wxmaxima, I suppose it needs to be re-represented as formatted stuff. I don't know if this is done all at once or in sections. But if it must be converted to xml or something like that, there's a lot a stuff there. maybe another factor of 5?

Moral of this story.

It is easy to come up with an example that is too large, practically, for a computer algebra system. That's part of the marvel of mathematics.

RJF

ElkPal commented 7 years ago

Hello Gunter,

I'm rather disappointed that the issue #804 has been closed. I see no satisfying statement. The calculations of Richard are unreproducible for me cause I don't understand the commands <length(ratexpand....> , <rat(product...> , <ratsimp(product...>. Anyway, the conclusions of Richard are wrong as respects the predicted file size. Please have a look at the files; open the .txt using notepad++. We are far far away from a 1 GB limit and nowadays file sizes up to one GB under 64 bit systems should be no problem. You see that with each term more the file size doubles; that means by taking the debian wxMaxima as a basis(where the UTF-8 format has an unbeatable ease of use) I could reach the exponent of 27 staying with a file size under 1 GB. And using the Win7 wxMaxima as a basis with .wxmx output format I would expect with last term (x-tt) [(ratsimp(...)]a file size of about 38.500 kB and n o t 2.7 GB.

best regards

LastTerm(x-tt)UTF-8Debian8_5_0wxMaxima13_04_2.txt LastTerm(x-s)UTTF-8Debian8_5_0wxMaxima13_04_2.txt wxMaxima.zip

ElkPal commented 7 years ago

Hello Gunter,

although the issue #804 has been closed, please have a look on my last comment

especially on the attached files and their sizes.

best regards

                      ElkPal

Von: Gunter Königsman [mailto:notifications@github.com] Gesendet: Donnerstag, 24. November 2016 08:26 An: andrejv/wxmaxima wxmaxima@noreply.github.com Cc: ElkPal hkoob@t-online.de; Author author@noreply.github.com Betreff: Re: [andrejv/wxmaxima] issue in (#804)

Closed #804 https://github.com/andrejv/wxmaxima/issues/804 .

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/andrejv/wxmaxima/issues/804#event-870704233 , or mute the thread https://github.com/notifications/unsubscribe-auth/AWQhX3DjURw8mmDIOX7QAq2iZTLMM8mfks5rBTv2gaJpZM4KsgMf .

gunterkoenigsmann commented 7 years ago

With the help of Richard's answer I am confident I can fill in the missing pieces what uses up the memory:

The final result is only a few 100 Megabytes.
But if in order to reach it a full list of all terms of the expanded equation is built - and this list needs way more memory than the final result that.
I don't know how complicated the stuff is that ratsimp() does do. My guess would be that it does do some advanced things that might need loads of temporary data for finding complicated transforms or in order to speed things up.
Another factor that reduces the memory efficiency might be that for most calculation steps two copies of the equation are needed: One that is read and the one that is currently written to.
For transferring the equation to wxMaxima it is converted into XML tags - that are kind of human-readable, but not extremely memory efficient.
In order to display the equation wxMaxima and in order to determine where to introduce line breaks and which fractions are short enough that they can be displayed as 2D math generates a separate object out of each element of the equation.
These objects not only contain the text. But in order to speed up the display algorithms they need to cache information like the pointer to the next and previous object, their height and their width. Each object also needs to know if it should return a fancy representation when converted to a string.
and then several versions of the result are placed on the clipboard: A text version, a MathML version, a RTF version...

...and the current windows installers contain only a 32-bit version of wxMaxima. All of this together might make the problem too big for today's computers, currently.

KubaO commented 4 years ago

Looking back, @ElkPal:

I don't understand the commands length(ratexpand....> , <rat(product...> , <ratsimp(product...>

These are all maxima commands and are documented reasonably. To get e.g. h(20), copy-paste the following to Maxima or wxMaxima:

h(n):=length(ratexpand(product(x-a[i],i,1,n)));
h(20);

That takes ~30 seconds on my laptop - it's time taken by Maxima, not wxMaxima.

And then: a .txt file is not something you can format for display as math. You need to decompose it, and that's where overheads come in.

Then let's try something smaller - let's actually display a small part of your problem:

hs(n):=ratsimp(ratexpand(product(x-a[i],i,1,n)));
hs(10);

There are almost 3 cells per term, on average. To display some 2^20 = 1M terms, or about 3M cells - would require well over 2.5G of memory from wxMaxima, and that's an underestimate. Each cell in wxMaxima is 250 -1000 bytes today, and work is being done on bringing this down, possibly by a lot. Back when this issue was filed it was worse. Additionally, there's lots of allocation churn in the parser - this could be improved, but it takes a decent chunk of time. The problem is (as usual) that wxXML makes string copies for data already present in the xml character stream - where instead of copies, string views would do instead.

In any case, the first order of business is to get ratsimp to produce such output under Maxima. @ElkPal: can you get hs(20) to finish in Maxima itself?

Then it'd be a nice benchmark for wxMaxima to be able not to choke up on such input. At the moment, VTune has a field day with the results. Basically every imaginable level is a problem in itself: I/O buffering, XML parsing, string allocations, cell allocation, etc. The best I can figure out is that the way cells are allocated and managed would need to be vastly different, to minimize per-cell overhead - by a lot, perhaps to 32 bytes for a TextCell with just one or two letters. That's a longer term project, that I would be executing along with a fundamental rework of the layout code. But this is a multi-year project in terms of calendar time (not man-hours, it's just that it's only a hobby for me).

The basic Cell idea is to have cells below a certain size be immutable and be allocated and constructed in a single operation, such that only one allocation is made. Such cells would be small enough that they would also contain their CellPtr Observer block. The memory for the cells would need to come from an arena ("tape") allocator, and instead of using huge pointers (8 bytes on 64-bit platforms), the cells would use indices to 16-byte (or similar) arena blocks. A 32-bit quantity would be able to index arena for documents with up to 1G cells. That's a reasonable upper bound, as such an arena would be 16GB or 32GB in size, depending on what granularity we chose. Something like CapnProto could be used as a drop-in data serializer/deserializer/arena manager with overheads low enough that it'd take care of the low-level, compact memory layout (perhaps with some domain-specific adaptations later on). I use CapnProto a lot so I know it'd be one of viable options.

I was thinking about how to lower the allocation overhead in cells and it's only due to @ElkPal 's issue that I realized that code exists that solves similar problems.

The kicker? Instead of formatting xml on Maxima side, we'd be sending a CapnProto stream directly - that can be produced with a much lower overhead, and the send-only variant of the code would be way, way simpler to implement in LISP than a generic solution. XML is just too much to deal with. We could do JSON if we wanted to use a fast JSON parser

KubaO commented 4 years ago

Baseline: at the moment, an expression (x) in Maxima output takes about 2.5kB of cell memory.

KubaO commented 3 years ago

I can do the following right now in release mode. It consumes about 770MB of RAM on a 64-bit system.

hs(n):=ratsimp(ratexpand(product(x-a[i],i,1,n)));
hs(17);

There's plenty of room for improvement, but it's better than what we had before. Some time is wasted in the stream receiver from the maxima process. 99% of recalc time is wasted waiting for RAM, because the cells are still relatively huge... But at least it doesn't hang. It just takes a bit.

With a bit more work, we should be able to at least have hs(17) display and recalculate quickly. Then maxima will be the slow point. But we have a handle on RAM, somewhat. It used to be 10x worse at least.

There are some regressions that my code introduced compared to "original" wxMaxima (pre-2020 code base), but those used to be buried under everything else. Now I see improvements galore :)

KubaO commented 3 years ago

The recent changes specifically target improvements visible when performing this benchmark. It turns out that it's an excellent one, because it tends to "produce" very general improvements that provide lots of immediate benefit.

The fix-maxima-receive-perf feature (being in final stages of tweaking) has just removed ~10s of character-by-character decoding when receiving the large reply from maxima - a major slow-down with this benchmark, but a good contributor to overall performance as well even if it's down low in the profiles - it's a case of a death by a thousand cuts, and we've got yet another cut less now :)

KubaO commented 3 years ago

Thanks to Wolfgang's parser improvements and my work, wxMaxima can now "fit" a problem of size 19 (see below). It uses about 3GB of RAM to fit all the cells, and 1GB more while the XML parser is running. The problem size 20 causes sbcl maxima to crash due to heap exhaustion on my machine. At least as far as wxMaxima is concerned, we're certainly heading in the right direction.

"Fitting" the problem doesn't mean that it's usable yet: the cell layout still takes too long - tens of minutes for problem size 19, 4 minutes for problem size 17.

But that's now an algorithmic issue just as much as memory use issue: we're doing way too much work in computing cell layout before screen redraw is possible, and the cells are big enough to slow this process down.

On Windows, the text sizing is quite fast, surprisingly enough, and actually going through all the cells and calculating their text extents is quite fast. But it may not always be the case - especially not on other platforms.

There are two kinds of strings we size: small strings, like short indices and variable names - these can be shared by many cells, and thus a common cache for their sizes would help. I'd expect larger strings to be mostly used by one cell only, so their size can be cached within the cell itself, like we do.

hs(n):=ratsimp(ratexpand(product(x-a[i],i,1,n)));
hs(19);

KubaO commented 3 years ago

The improvements being made while working on #1445 bring hs(17) time down from 4 minutes to ~40 seconds.

wxMaxima-developers / wxmaxima