zbateson / mail-mime-parser

An email parser written in PHP
https://mail-mime-parser.org/
BSD 2-Clause "Simplified" License
458 stars 58 forks source link

Memory Leak #193

Closed ellisonpatterson closed 2 years ago

ellisonpatterson commented 2 years ago

I have a long-running script and I noticed that memory usage rises slowly over time until it exceeds the maximum memory limit set for PHP.

I'm wondering if something isn't being removed when I unset the message.

You can see it happening live by running this test script:

<?php

$email = '
From: "Doug Sauder" <doug@example.com>
To: "J�rgen Schm�rgen" <schmuergen@example.com>
Subject: Die Hasen und
   die Fr�sche (Microsoft Outlook 00)
Date: Wed, 17 May 2000 19:08:29 -0400
Message-ID: <NDBBIAKOPKHFGPLCODIGIEKBCHAA.doug@example.com>
MIME-Version: 1.0
Content-Type: text/plain;
  charset="iso-8859-1"
Content-Transfer-Encoding: 8bit
X-Priority: 3 (Normal)
X-MSMail-Priority: Normal
X-Mailer: Microsoft Outlook IMO, Build 9.0.2416 (9.0.2910.0)
Importance: Normal
X-MimeOLE: Produced By Microsoft MimeOLE V5.00.2314.1300

Die Hasen und die Fr�sche

Die Hasen klagten einst �ber ihre mi�liche Lage; "wir leben", sprach ein
Redner, "in steter Furcht vor Menschen und Tieren, eine Beute der Hunde, der
Adler, ja fast aller Raubtiere! Unsere stete Angst ist �rger als der Tod
selbst. Auf, la�t uns ein f�r allemal sterben."

In einem nahen Teich wollten sie sich nun ers�ufen; sie eilten ihm zu;
allein das au�erordentliche Get�se und ihre wunderbare Gestalt erschreckte
eine Menge Fr�sche, die am Ufer sa�en, so sehr, da� sie aufs schnellste
untertauchten.

"Halt", rief nun eben dieser Sprecher, "wir wollen das Ers�ufen noch ein
wenig aufschieben, denn auch uns f�rchten, wie ihr seht, einige Tiere,
welche also wohl noch ungl�cklicher sein m�ssen als wir."
';

echo "Memory usage before: " . memory_get_usage(true) . PHP_EOL;

$i = 0;
while($i < 5000) {
    $message = \ZBateson\MailMimeParser\Message::from($email, false);
    unset($message);

    $i++;
}

echo "Memory usage after: " . memory_get_usage(true) . PHP_EOL;

Output from script:

user@testing:~/cliapps$ php app.php 
Memory usage before: 6291456
Memory usage after: 20971520
zbateson commented 2 years ago

Hi @ellisonpatterson --

Yeah, that doesn't look good. I'm not sure off-hand what would be causing that -- there are singletons for instance but they should only be created once... there aren't many global variables and object destruction/memory freeing should be handled by php as things get out of scope.

Probably the best way to get to the bottom of it would be to run it with xdebug and trace it.

zbateson commented 2 years ago

Hi @ellisonpatterson --

Looking more closely at this, it does seem to me like memory is reclaimed okay. Note that 'memory_get_usage(true)' returns the amount of memory allocated by the system rather than memory currently in use by php. It's entirely likely php won't return memory to the system, but will reuse it internally.

I've modified what you did slightly so it's like so instead (also note the addition of a call to gc_collect_cycles() and sleep(5) after it), to get a better idea of what may be going on:

echo "Memory usage before: " . (memory_get_usage(false) / 1024 / 1024) . PHP_EOL;
echo "Memory peak usage before: " . (memory_get_peak_usage(false) / 1024 / 1024) . PHP_EOL;

$msg = \ZBateson\MailMimeParser\Message::from($email, false);
unset($msg);

echo "Memory usage after load: " . (memory_get_usage(false) / 1024 / 1024) . PHP_EOL;

$i = 0;
while($i < 5000) {
    $message = \ZBateson\MailMimeParser\Message::from($email, false);
    unset($message);
    $i++;
}
gc_collect_cycles();
sleep(5);

echo "Memory usage after: " . (memory_get_usage(false) / 1024 / 1024) . PHP_EOL;
echo "Memory peak usage after: " . (memory_get_peak_usage(false) / 1024 / 1024) . PHP_EOL;

What I've noticed is regardless of whether looping over 5000 or increasing that to 100,000, I get the same result:

Memory usage before: 0.45564270019531
Memory peak usage before: 0.62090301513672
Memory usage after load: 1.5339202880859
Memory usage after: 1.7933044433594
Memory peak usage after: 13.119705200195

I do get a peak of 13MB, but it always seems to go down to 1.79MB.

That's not to say there's definitely not an issue, just that I haven't been able to identify one. If you're able to shed more light or find something else, please do reopen/let me know. My current findings are that there doesn't appear to be a leak, things seem to be destroyed correctly.

filipagh commented 1 year ago

@zbateson we have same issue but when we process 1 mail with big content

mail size (MB) php mem before mail parse php mem after parse php max mem
6 28 46 59
12 28 65 78
18 28 85 102
24 28 104 126
30 28 124 150

our max mem target is 128MB so we cant parse 25+MB mail

so far i see 2 problems

1. \ZBateson\MailMimeParser\Message\MultiPart::getChildParts \ZBateson\MailMimeParser\Message\MultiPart::getChildCount

now getChildCount use getChildParts and getChildParts use iterator_to_array (with RecursiveIterator) which make memory peak

with big content we should not use in our code getChildParts but getChildIterator

but getChildCount i think should use also iterator in lib what you think?

  1. memory leak in getChildParts parse 30MB mail make ~100MB memory allocation (for each 6MB of mail content -> ~20MB memory allocated), after GC call memory is only 23MB independent of mail size

here is link to tested mails

zbateson commented 1 year ago

Hi @filipagh --

Using your largest email I'm not getting the same results. I reused the code I have above, like so (tested on 7.4 and 8.2):

use ZBateson\MailMimeParser\Message;
use GuzzleHttp\Psr7;

echo "Memory usage before: " . (memory_get_usage(false) / 1024 / 1024) . PHP_EOL;
echo "Memory peak usage before: " . (memory_get_peak_usage(false) / 1024 / 1024) . PHP_EOL;

$msg = Message::from(Psr7\Utils::streamFor(fopen('test-30.eml', 'r')), false);
$msg->getChildParts();
echo "ChildCount: ", $msg->getChildCount(), "\n";

echo "Memory usage after load: " . (memory_get_usage(false) / 1024 / 1024) . PHP_EOL;
echo "Memory peak usage after: " . (memory_get_peak_usage(false) / 1024 / 1024) . PHP_EOL;

Example result I get:

emory usage before: 0.51663970947266
Memory peak usage before: 0.63335418701172
ChildCount: 2
Memory usage after load: 1.9416275024414
Memory peak usage after: 2.0166473388672

Could you:

  1. Post a very simplified test case (perhaps based on mine) showing the results you're getting?
  2. Create a new ticket for this if it's an issue.

All the best

filipagh commented 1 year ago

i found out that we have not seekable stream (DBstream) so CachingStream is used and i think data is stored in memory then