xpdAcq / mission-control

Releases, Installers, Specs, and more!
0 stars 4 forks source link

_end_beamtime issue #135

Open chiahaoliu opened 6 years ago

chiahaoliu commented 6 years ago

Reported by @jmbai2000 at beamline and converge discussion in email threads to an issue

_end_beamtime process takes a long time when user has a beamtime with a large number of tiffs. The origin of the problem is we need to copy&paste local archive to remote drive (GPFS-mounted) and the current bottleneck is the network speed.

Note: Tested in early May, it took ~10 mins for a beamtime with ~100 tiffs to finish. If everything scales linearly (best case), it's possible that a beamtime with thousands of tiffs will take hours.

Expected Behavior

_end_beamtime finishes within a reasonable time regardless of the amount of data collected.

Current Behavior

_end_beamtime may take hours when user collects sizeable data.

Context

xpdAcq expects an empty xpdUser directory to start a new beamtime, ie, archive file should be completely moved to remote.

chiahaoliu commented 6 years ago

maybe related to #108

attn: @dooryhee @sghose

CJ-Wright commented 6 years ago

One potential solution is to not archive the tiff_base data. This will reduce the memory footprint and increase the transfer speed. We should be able to reproduce the data via the re-running of the analysis pipeline.

sbillinge commented 6 years ago

I see two solutions.

1) I like CJ's suggestion. Actually, since we have a save_last_tiff pipeline now it would be relatively easy to rerun the user experiment and save out all the tiffs when asked. It would be much much harder to keep track of the tiffs that they actually did save during the experiment and just save out those on a replay request. we could build a little front-end also that allows some filtering before running the, let's call it replay_last_tiff so not all tiffs are saved. For pipelines to work again post-facto we may have to save all the darks as well. This could be a large file. In this scenario, user's need to be very clear that their local tiffs will not be archived so they need to be saving them during the experiment if they want to walk away with them.

2) we have two places where we site .../xpd_users. UC looks like:

  1. start_beamtime creates a clean .../xpd_users in location 1, so .../1/xpd_users
  2. users do their experiments
  3. BLS runs _end_beamtime which archives data as normal
  4. BLS runs start_beamtime
  5. start_beamtime either detects that the previous experiment was done in .../1/xpd_users, or it detects that .../1/xpd_users is not empty and looks to see if .../2/xpd_users is empty, then it builds a clean env in .../2/xpd_users
  6. New users inherit a pristine env in .../2/xpd_users even as end_beamtime is still archiving user 1 data.
  7. On completion, end_beamtime obliterates .../1/xpd_users as usual leaving it ready for the next start.

There may also be a hybrid solution. I like (2) because it is not UI-breaking......from the user/BLS perspective the workflow is the same. I like (1) because it takes us in the direction we want to go in, that we are actually starting to use the databases as they should be used.

My suggestion (discussion please?): 1) we try and implement (2) asap and roll it out (I am sure there are unforeseen problems) 2) we put (1) on a future release milestone, but initially have it as an option.....which users may like.....of a tiff-free archive, or a tiff-selected archive that will fit on their external hard-drive.

On Thu, Aug 23, 2018 at 1:50 PM Christopher J. Wright < notifications@github.com> wrote:

One potential solution is to not archive the tiff_base data. This will reduce the memory footprint and increase the transfer speed. We should be able to reproduce the data via the re-running of the analysis pipeline.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/xpdAcq/mission-control/issues/135#issuecomment-415509418, or mute the thread https://github.com/notifications/unsubscribe-auth/AEDrUdeJ7W5xhGWkqhaZdjTv6WOV6KQqks5uTutTgaJpZM4WESqU .

chiahaoliu commented 6 years ago

xpdAcq and xpdAn look at directories which are configured by a yaml file at each beamline, maybe it's better to keep xpdUser directory at the same location so that we don't need to poke the configuration too often. So I would like to propose a variant of UC (2) as following:

  1. User runs _end_beamtime
  2. Program archives contents of the current xpdUser directory
  3. Program creates a tarball inside current xpdUser directory:
  4. Program renames current xpdUser directory as
  5. Program moves current xpdUser directory under as xpdConfig/
  6. Program creates a fresh xpdUser directory
  7. Program transfers to remote location
  8. BLs can start new beamtime anytime while transferring
  9. Program remove backup directory inside xpdConfig when the archive is transferred to remote location.

Does it make sense?

CJ-Wright commented 6 years ago

@xpdAcq/bls thoughts?

adelezh commented 6 years ago

Correct me if I understand wrong. So the archives will happen at xpdUser directory? after archive, program will rename and move xpdUser to xpdCofig, then data transfer archived data to remote location, while transferring we can start_beamtime.

what if procedures are:

  1. User runs _end_beamtime
  2. program renames current xpdUser directory as
  3. Program creates a fresh xpdUser directory
  4. Program moves directory under as xpdConfig/
  5. Program archives contents of the xpdUser_sth_sth directory and transfers to remote location
  6. BLs can start new beamtime anytime while archiving and transferring data. 7.Program remove backup directory inside xpdConfig when the archive is transferred to remote location.

You think this way will be faster? or it won't have big difference?

dooryhee commented 6 years ago

Is this the final plan (sounds ok to me)? Is it part of the post-school new release?

CJ-Wright commented 6 years ago

Not to my knowledge but I'll check.

chiahaoliu commented 6 years ago

@adelezh yes, your understanding is correct and thanks for the suggestion, I really like it. Even though the archiving process is reasonably fast, we should avoid downtime as much as we can. Based on the discussion here, _end_beamtime refactoring will be implemented with the logic described by @adelezh.

DanOlds commented 5 years ago

Note on the current implementation of end_beamtime:

The new functionality is great, essentially instant. Two big flaws:

1.) Please, please, PLEASE take out the question at the end to delete the archive it just made. Users (and myself) have lost notes and analysis that was stored in that folder because someone hits 'y' at the end of the archive procedure, not realizing this deletes the archive. Unless this is also kept somewhere else that isn't deleted? It would be great to find out it was, but I don't think that it is.

2.) It appears that bsui needs to be restarted after running end_beamtime, or it does not correctly 'forget' the previous sample names associated with certain sample numbers. This is to say if you had samples in bt.list such as:

  1. setup
  2. Ni
  3. sample_A
  4. sample_B

and then run _end_beamtime, start a new beamtime, and import a new sample list (without exiting bsui), you can type bt.list and see:

  1. setup
  2. Ni
  3. sample_C
  4. sample_D

and yet, running xrun(3,0) would produce a file associted with sample_A.

Restarting bsui seems to remedy the issue, but I figured this counts as a bug.

sbillinge commented 5 years ago

I think that makes sense.

On Fri, Aug 31, 2018 at 1:34 PM Timothy Liu notifications@github.com wrote:

xpdAcq and xpdAn look at directories which are configured by a yaml file at each beamline, maybe it's better to keep xpdUser directory at the same location so that we don't need to poke the configuration too often. So I would like to propose a variant of UC (2) as following:

  1. User runs _end_beamtime
  2. Program archives contents of the current xpdUser directory
  3. Program creates a tarball inside current xpdUser directory:
  4. Program renames current xpdUser directory as
  5. Program moves current xpdUser directory under as xpdConfig/
  6. Program creates a fresh xpdUser directory
  7. Program transfers to remote location
  8. BLs can start new beamtime anytime while transferring
  9. Program remove backup directory inside xpdConfig when the archive is transferred to remote location.

Does it make sense?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/xpdAcq/mission-control/issues/135#issuecomment-417737627, or mute the thread https://github.com/notifications/unsubscribe-auth/AEDrUUOLc0QvIuvva55kSsKeinfOMg4tks5uWXO7gaJpZM4WESqU .

-- Prof. Simon Billinge Applied Physics & Applied Mathematics Columbia University 500 West 120th Street Room 200 Mudd, MC 4701 New York, NY 10027 Tel: (212)-854-2918 (o) 851-7428 (lab)

Condensed Matter Physics and Materials Science Dept. Brookhaven National Laboratory P.O. Box 5000 Upton, NY 11973-5000 (631)-344-5661

email: sb2896 at columbia dot edu home: http:// http://nirt.pa.msu.edu/bgsite.apam.columbia.edu/

CJ-Wright commented 5 years ago

@DanOlds is the terminal giving any errors during the end beamtime process?

CJ-Wright commented 5 years ago

@DanOlds can you please open a separate issue for 2?

chiahaoliu commented 5 years ago

@DanOlds thanks for reporting this issue. we are working on the first issue.

Regarding the second issue, I am wondering if you checked the metadata actually goes into db? Does xrun(3, 0) give a header with sample_name = sample_C?

One possible scenario I can think of is the xrun is still linked with bt from previous beamtime, therefore, old sample indices. We would need to do xrun.beamtime = bt again to update the reference (bt is not singleton)

DanOlds commented 5 years ago

@chiahaoliu that could very well be the case. I'm happy to test that next time there is a break in the schedule at the beamline, because I can't remember the exact sequence of events that led to this behavior on Monday. If that is the case, it might be advisable for bsui to 'forget' xrun at the end of _end_beamtime, to force the next user to re-associate xrun with the bt.

adelezh commented 5 years ago

Yes, we need to do xrun.beamtime=bt after we start a new beamtime, otherwise xrun is still link to the old bt. It happened before at D hutch.

Hui

Sent from my iPhone

On Mar 20, 2019, at 5:02 PM, DanOlds notifications@github.com wrote:

@chiahaoliu that could very well be the case. I'm happy to test that next time there is a break in the schedule at the beamline, because I can't remember the exact sequence of events that led to this behavior on Monday. If that is the case, it might be advisable for bsui to 'forget' xrun at the end of _end_beamtime, to force the next user to re-associate xrun with the bt.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

chiahaoliu commented 5 years ago

@DanOlds Just examined the code again, I found answering y at the end of the archiving process should lead to no action in the code ref.

Could you elaborate a bit more on this issue? Does the remote archive go to an unexpected place which is different than the path shown by _end_beamtime function?

DanOlds commented 5 years ago

@chiahaoliu the current behavior, to my understanding, is that the user_data directory is moved/renamed to one called 'user_data_PIname_SAF_datatime'. A new, empty 'user_data' directory is then created. The 'user_data_PIname_SAF_datetime' is then effectively an archive of the users entire working directory. There is no reason to delete it automatically via the y/n option at the end of _end_beamtime. It seems reasonable to me that we could simply manually delete that archive at a later time (thus assuring time for the contained notes/analysis to be retrieved or transfered).

If you look in the /nsls2/xf28id1/xpdacq_data folder currently, you'll see a number of these directories, as well as the active 'user_data' directory.