wettenhj / mytardis

MyTardis - a data management system for private lab/facility data
http://mytardis.github.com
GNU General Public License v3.0
0 stars 0 forks source link

MyData uses a new Uploader model in MyTardis. Discuss pros/cons of including in core MyTardis #4

Closed wettenhj closed 9 years ago

wettenhj commented 9 years ago

Use cases for the new Uploader model added to (this fork of) MyTardis are discussed below.

Fields of the Uploader model are highlighted in bold below, because there has been some discussion of whether we need so many fields.

  1. When a MyTardis administrator receives a request for staging access (automatically generated by MyData), they can look up the uploader record associated with the request, and check the instrument record (which gives them the Facility record via a foreign key), the contact name and contact email for the instrument PC in the uploader record, so they know who to contact when the upload-to-staging access has been set up.
  2. User-facing instrument PCs are difficult to identify uniquely in a reliable way (often it's easy for users to change IP addresses, hostnames etc.) The best we can do is use the MAC address of the network interface (e.g. Ethernet) as a unique identifier for the "uploader" record. It's important that we don't accidentally grant staging access to the wrong instrument PC (or other upload PC).
  3. It is envisaged that Facility records will be created by the MyTardis administrator, and will not be modifiable by MyData, however Instrument records need to be modifiable / createable by MyData, because the purpose of MyData is to make it easy to add a new instrument PC to MyTardis. If a MyData user tries to assign an instrument name which is not specific enough and is already used elsewhere in their Facility, e.g. "Nikon Microscope", then MyData should be able to give the user some indication of which instrument PC has already used the duplicate instrument name. In this case, MyData could ask MyTardis to report the hostname (e.g. nikontraining.mmi.monash.edu.au), the OS name (e.g. "Windows"), and maybe the OS username, e.g. "nikontraining", which can help the facility manager using MyData to determine which instrument PC has the duplicate instrument name. Custom authorization is required in TastyPie, because generally a user should only be able to access their own Uploader record (whose MAC address matches theirs), but in this case, a MyData users (facility managers) need to be able to access a few fields (hostname, os_name, os_username) from another uploader record with the same instrument_name.
  4. Uploader records contain various fields which can be used by MyTardis/MyData/Store.Star support staff to diagnose problems with MyData installations without having to visit the instrument PC. These fields include the User Agent Name being used to upload to MyTardis, e.g. "MyData", and the User Agent Version e.g. "0.0.3" (maybe add git commit hash), the User Agent Install Location, e.g. "C:\Program FIles (x86)\MyData", the architecture (os_platform) which MyData was built with (e.g. MyData.exe could be built with a 32-bit Python), the architecture of the instrument PC (machine), the memory capacity of the PC, the OS version, the number of CPUs and the disk usage and capacity. The value of these fields is known from previous experience with supporting other wxPython GUIs like MASSIVE Launcher/Strudel. Often users don't give clear answers to these questions, and it can be difficult to visit instrument PCs physically to diagnose problems when they are spread across multiple sites (Clayton, AMREP etc.).
  5. I'm expecting fierce debate on this issue, but I believe that the TastyPie custom authorization for the Uploader model in this git branch does the right thing in allowing an anonymous user to create an Uploader record without authenticating to MyTardis first. Authentication problems with GUIs are extremely common - I.T. help desks are often asking users "Are you sure you typed your password correctly?" So if MyData waits until it has successfully authenticated to MyTardis before uploading diagnostic information for MyTardis administrators/support staff, then they could end up in a situation where the MyData users thinks that they have installed MyData and submitted a request for RSYNC upload access, but the MyTardis user might not see any request due to a bug in MyData or due to the user's inability to enter their password or API key. The MyData installation wizard should ask the installing user (facility manager) to agree to the terms and conditions of using MyData which are that once a valid MyTardis URL (which supports the new Uploader model) has been entered in MyData, diagnostic information about the PC will be sent to that MyTardis URL.
  6. The created time and updated time fields of the uploader record show the date when MyData was first run successfully on an instrument PC, and the last time it was run (on the same network interface). This could be useful for help desk staff to diagnose problems, e.g. if a user says "Our MyTardis uploads from this PC are broken", it is useful to the help desk staff / MyTardis administrator to determine how long it has been broken for, and how big the data backlog is which needs to be uploaded.
  7. The ipv4 address, ipv6 address, subnet mask and wan_ip_address fields are used for granting access to staging areas. Sometimes the administrator of the staging host will need to add the instrument PC's IP address to a hosts.allow file, or to an iptables firewall, or (in the case of NeCTAR/OpenStack) to a security group. Some instrument PCs have a public-facing IP address, so the IP address you get from "ipconfig" or "ifconfig" will be the same as the one you see when you navigate to http://www.whatismyip.com/, whereas other instrument PCs will be stuck behind firewalls / routers / gateways, so their internal and public IP addresses could be completely different. The wan_ip_address is determined on the server side (in TastyPie), using the django-ipware module, which uses the HTTP_X_FORWARDED_FOR header. The wan_ip_address is probably by far the most useful IP address to store in the uploader record, (for granting access to the staging host through a firewall), although this IP address might not be unique to an instrument PC, as multiple instrument PCs can connect to the Internet through a common gateway. There may be uses cases (I.T. help desk diagnosing MyData problems), where having an internal IP address is useful too, e.g. the I.T. administrator might be able to connect to the instrument PC via Remote Desktop Protocol or using a UNC path to access its filesystem via CIFS. Note that the hostname field in the uploader record is basically just the result of running "hostname" on the client machine - MyData makes no effort to contact a DNS server to determine a fully-qualified hostname from the PC's IP address. So there's no guarantee that the hostname can be used by an I.T. help desk to remotely connect to the PC, but the hostmae might help a facility manager to identify another PC in their facility. For example, if a MyData user (facility manager) tries to enter a duplicate instrument name, MyData might say "The instrument name 'Test Microscope 1' for facility 'Test Facility' is already being used on hostname 'JamesLaptop'."
wettenhj commented 9 years ago

To see how MyData determines the data to populate the Uploader record with, see: https://github.com/monash-merc/mydata/blob/master/UploaderModel.py#L238

wettenhj commented 9 years ago

Here's the code for the new Uploader model implemented in my "mydata" branch of MyTardis: https://github.com/wettenhj/mytardis/blob/mydata/tardis/tardis_portal/models/uploader.py

grischa commented 9 years ago

James, first off let me suggest for the future that before writing code for new database models or other core changes you talk with someone about your design. Steve and I always had several discussions for each change when we wanted to make core changes like this. These discussions were really helpful and avoided a lot of unnecessary work.

Regarding the Uploader Model, I believe it is completely unnecessary at this stage. However, rather than discussing each of your above points with you, I would like to talk about this as if no code had been written yet, because this is when we should have had this discussion in the first place.

Assumed situation:

Envisaged workflow:

If there are issues, the system and network information you mentioned is put in an email or an email template that can be sent manually or automatically to the administrator. For that, there could be a send_support_request API hook, because clients might have trouble sending emails due to firewall etc.

I might have overlooked something important here, of course. You have more experience with instruments than me, and I am open to a discussion of all this, which is why I suggested we meet on Monday.

steveandroulakis commented 9 years ago

My comments are based on the following personal opinons of mine:

  1. It's still very early days with this instrument integration work, and much may change in the future in how both the technical aspects and the 'business' of integration is carried out.
  2. The 'business' that surrounds a new instrument integration in a new or existing facility is going to be messy and drawn out process requiring meetings, back and forth interaction between experts on both sides and a bit of 'hand holding'.
  3. Facilities may have to make minor changes or concessions in their data management workflow to be able to participate with us.
  4. Facility managers will need to have a close-ish relationship with MyTardis/MyData folk, at least in the beginnings of instrument integration. This helps trust on both sides, and helps a smooth transition.
  5. When things go wrong with Strudel, the user submits a debug report that goes to the team via email or ticket or both. The same could happen with MyData, and not require any information to be stored or disseminated by MyTardis in the process.
  6. I'm assuming that all of these assumptions are reasonable, because as nice as MyData and MyTardis can be, it's not like putting toast into a toaster -- You can't just integrate instruments like you can buy pre-packaged sliced bread ready to go and put it into your 0-setup store-bought toaster. I'd like it to be like this, but I think it's premature to assume this is how it will go based on experience with 1 facility.

Given the diversity and complexity of instrument workflows, and setups, I feel the business of setting up new instruments doesn't currently require comparatively heavy and rigid infrastructure / workflows. I believe it risks facility X or instrument Y or user Z showing up and breaking the model and hence all the code and effort put in.

For example, to address your points.

  1. A button in MyData could say 'set up a new instrument', ask the user for their email address, a meaningful instrument name and description and a Facility (that comes from a target MyTardis DB) or if its a new facility perhaps that's a conversation significant enough to warrant discussion before going ahead with a MyTardis admin creating a facility entry. This info is then registered with MyTardis with a MAC address and a key for access to MyTardis is retreived. Then an email goes to a MyTardis sysadmin for approval of the instrument.
  2. I'm cool with many MAC addresses being associated with an instrument (covering for various NICs) as a security mechanism. This should indeed compliment some kind of key/password combo. Keys and passwords aren't completely secure on a shared system like this, nor is a MAC address, but the combination seems okay to me to deterr any opportunitsts looking to masquerade as an instrument and upload tonnes of movies from The Pirate Bay to Store.Sync.
  3. The issue of unique instrument names in a facility is a business problem to me. Enforce a unique name for a facility's instrument, one that makes sense to the humans who work there. If a facility has 2 Nikon Microscopes of the same model, ask them to to find a way to distinguish them by name. It's the price to pay to play.
  4. Support can be covered by a Strudel-like email sent from MyData when things go wrong. No need for MyTardis to know or store anything about this.
  5. I'm okay with anonymous requests for new instruments connected to already-stored facilities. If there's an approval process then some conversation should be taking place anyway before everything goes ahead.
  6. A simple query in MyTardis (that could have an interface over it) can determine when an instrument first started storing data, and last saw any data stored. This doesn't cover the time a new instrument request was made, but the MyData app could send that info (see #1)
  7. For me, networking for this kind of issue can have infinite problems based on infinite complexity of networking setups and firewalls. No amount of information stored by MyTardis nor any system logic can truly solve any issues here. The MyData 'new instrument' button can send some info about this over in a 'new instrument' request. Also, networking changes over time so storing it on the MyTardis side risks out of date information (and we're never going to sync that, that's going way too far!).

I greatly admire that you're designing this as a system that eventually can exist on its own without human intervention. I mean, when was the last time anyone spoke to anyone in order to set up or use Dropbox? However, I truly believe this is several orders of magnitude more complex.

To quickly try and find a metaphor. What we're doing is more architecting someone's house with some tools and know-how rather than handing someone an IKEA manual and expecting everything to come out okay.

I'm sure many are carrying the assumption that facilities across campus/universities/states can integrate a couple of hundred instruments or more over time in an automated fashion. As someone who has witnessed lots of eResearch software development and knows this project well, I think that's a lofty goal that's missing another 5 or so developers to do properly and so we shouldn't set the bar there for now and instead concentrate on getting our existing instruments right and slowly connecting new ones over time.

wettenhj commented 9 years ago

Hi Grischa,

Thanks for your comments (I haven't read Steve's yet).

I'm happy to discuss your perspective that maybe we don't need an Uploader model at all on Monday. Regarding the following comment you made:

If there are issues, the system and network information you mentioned is put in an email or an email template that can be sent manually or automatically to the administrator. For that, there could be a send_support_request API hook, because clients might have trouble sending emails due to firewall etc.

The current design is that the MyData instance contacts the MyTardis server not just "if there are issues" but as a routine part of its initialization on its first run - it generates an SSH key-pair to use for RSYNC access to a staging host and then sends a request to the MyTardis server, including the public key. The networking information is not just used "if there are issues". The staging host (e.g. a Vera) may require the new instrument PC to be whitelisted in an /etc/hosts.allow or iptables rules, so the MyTardis administrator may need to figure out if the instrument PC is on a static IP address or whether it has an IP range which can be deduced by its subnet mask. These are all issues I've had to deal with in the past which involved manually visiting microscope PCs multiple times (which was a pain and a waste of time), so I'm trying to eliminate that need by registering the instrument PC properties in a central place.

Maybe you think that trying to automate the request for RSYNC over SSH access is overkill, but I can guarantee that my users would prefer to have this done automatically than have to manually create SSH key-pairs etc.

Regarding putting the system and network information in an email, configuring a client-side GUI to send emails automatically (and manage credentials appropriately) is definitely not easy, unless all of your users are internal with a common SMTP server. I guess this is what you are eluding to by suggesting creating a send_support_request API hook, which sounds fine with me. I don't feel too strongly about whether the data POSTed from MyData ends up in a database model like "Uploader" - but I do feel strongly that MyData should POST information to a server and let the server deal with it, rather than trying to send notifications (and requests for support) directly from the client-side GUI.

I haven't implemented submitting debug logs from MyData yet, but I imagine I'll follow the same basic mechanism we used for the MASSIVE/CVL Launcher (Strudel), i.e. POST to a server (we currently use https://cvl.massive.org.au/cgi-bin/log_drop.py) and then have the server generate emails as needed. The big difference with the MASSIVE/CVL Launcher was that it made sense to host the submitted debug reports in a central place (cvl.massive.org.au) for all MASSIVE and CVL users, whereas for MyTardis, I think it makes more sense to send them to each individual MyTardis server, so that you can ensure that the debug reports go to the appropriate MANAGERS in settings.py.

So I didn't start out by thinking "we need an Uploader model" - it was more a case of "I don't want to have to deal with the complexity of getting client-side applications to send emails automatically (and manage credentials appropriately). POSTing to a server and getting the server to generate notifications makes things much easier. Of course the information doesn't necessarily have to be POSTed to a database model - it could just be dumped in a directory outside of the database, but using the database seemed nicer to me.

I found the process of adding new microscope PCs to the atom provider workflow very messy in terms of manually keeping track of SSH key-pairs, authorized_keys files etc. I would find it much easier to keep track of all of the instruments (which can eventually be done from Jason's Facility View) if we had all of the instrument PCs registered in a central place, with more detailed information than what you get by just manually dumping an SSH public key in an authorized keys file. I'm sure our facility manager stakeholders would find it useful to be able to see information in MyTardis's new Facility View from the "Uploader" model, including what version of MyData is installed on each instrument PC and when was the last time that MyData instance uploaded data to MyTardis etc.

We can discuss this further on Monday - hopefully that clarifies a few aspects of my motivation in going down this path.

wettenhj commented 9 years ago

Hi Steve,

Thanks for your comments! I have added some replies inline.

It's still very early days with this instrument integration work, and much may change in the future in how both the technical aspects and the 'business' of integration is carried out.

I agree 100% about the risk of change etc. I think my perspective is slightly different from yours and Grischa's, in that I think it's really important to try hard to get this working well for one facility first in a reasonable time frame, even if it involves making a few rushed design decisions which might need to be reworked later on when expanding to other facilities. I wouldn't care if models I added into MyTardis got merged in with a warning attached that they are considered unstable and may be deprecated soon, until we establish their widespread value. I started this recent discussion thinking that it would be a big win for the "usability" of MyData to be able to state in MyData's User Guide that MyData is compatible with an "official" version of MyTardis, but another part of me thinks we should just focus on getting things right for one facility first, and it doesn't matter if that facility is using the "mydata" fork of MyTardis for now. But having said that, I'm still very keen to discuss what changes you and Grischa would like to see in the "mydata" fork before being interested in merging it into the main MyTardis repository.

The 'business' that surrounds a new instrument integration in a new or existing facility is going to be messy and drawn out process requiring meetings, back and forth interaction between experts on both sides and a bit of 'hand holding'.

I think the priority is to get MyData working nicely for one facility initially, and for the facility we are targeting, we have already advanced through a lot of the drawn out meetings and 'hand holding'.

Facilities may have to make minor changes or concessions in their data management workflow to be able to participate with us.

Yes, I try to be a middle man, compromising between what is good for the facility I'm working with and what is good for MyTardis in general. I spent some significant time earlier this year ensuring that MyData doesn't require a MyTardis administrator account (to create MyTardis accounts from folder names), not because the facility I'm working with cares about that, but because the local MyTardis developer community cares about being able to merge multiple MyTardis's into one without stepping on each others' toes.

When things go wrong with Strudel, the user submits a debug report that goes to the team via email or ticket or both. The same could happen with MyData, and not require any information to be stored or disseminated by MyTardis in the process.

Strudel POSTs the debug report to a server (cvl.massive.org.au) which dumps it into a text file in a regular directory on disk, and then a CRON job generates emails. So it doesn't go in a database, but the information is still stored on a server, until an administrator chooses to delete it. For MyData, we could replace the directory on disk with a MyTardis storage box and replace the CRON job with a Celery task, or we could maintain a separate CGI script outside of MyTardis if you prefer, but I don't really understand why we would want to, given that MyTardis and TastyPie already have mechanisms to handle POSTed data.

I'm assuming that all of these assumptions are reasonable, because as nice as MyData and MyTardis can be, it's not like putting toast into a toaster -- You can't just integrate instruments like you can buy pre-packaged sliced bread ready to go and put it into your 0-setup store-bought toaster. I'd like it to be like this, but I think it's premature to assume this is how it will go based on experience with 1 facility.

I'm not assuming anything about what will happen beyond 1 facility at this stage. I'm just assuming that we need a system which appears user friendly, robust and reliable by the time we go into production. And sure, things will go wrong even with an application which aims to be user friendly - that's why you have to work so hard to do "clever" things to make users glad that they are using an automated program despite the inevitable bugs which could otherwise make the facility manager question whether they want to recommend the system to their researchers as an official facility policy.

A button in MyData could say 'set up a new instrument', ask the user for their email address, a meaningful instrument name and description and a Facility (that comes from a target MyTardis DB) or if its a new facility perhaps that's a conversation significant enough to warrant discussion before going ahead with a MyTardis admin creating a facility entry. This info is then registered with MyTardis with a MAC address and a key for access to MyTardis is retreived. Then an email goes to a MyTardis sysadmin for approval of the instrument.

As discussed in my reply to Grischa's comment, I don't want to have to deal with the complexity (SMTP configuration etc.) of sending automated emails directly from a client application - it's much simpler to POST to a server first, and have the server take care of sending emails (e.g. using CRON or Celery). In many cases, I don't really care if the info submitted from MyData is POSTed into a database model or into a location outside of the database, but I can see advantages of using the locations managed by MyTardis for these requests/info submitted from MyData, even if the request/info records (managed by the database application) are only meaningful for a limited time, just as messages in the djkombu_message table (used by Celery) are only meaningful for a limited time.

I'm cool with many MAC addresses being associated with an instrument (covering for various NICs) as a security mechanism.

For the facility we are targeting (and probably for any other facility), I don't regard the method I'm currently proposing as being a "security mechanism" but more a deterrent for people who might consider trying to meddle with the facility's data store. If we wanted to turn it into a "security mechanism", MyData would not be storing a private key on each instrument PC. Instead, users would have to plug in a USB stick with their own personal private key every time they used MyData, but that would require extra development time to deal with USB stick insertion and removal events, and I don't think our facility stakeholders want to inconvenience their users in that way.

For me, networking for this kind of issue can have infinite problems based on infinite complexity of networking setups and firewalls. No amount of information stored by MyTardis nor any system logic can truly solve any issues here.

I disagree. I would agree if you had said "No amount of information ... can truly solve all issues here", but "any issues"? Really?. I have logged this type of information before in previous database applications I have worked on, and found it to be useful. The main value is for creating firewall / iptables / hosts.allow rules for giving your instrument PC RSYNC over SSH access to a staging server, and determining whether the IP address as seen by the staging server's sshd changes within a defined IP range, or whether it is fixed. For example, if an instrument PC is rebooted and gets a new IP address, and suddenly it can't connect to the staging server, it could be because you (the MyTardis administrator) specified the PC's IP range incorrectly in the firewall rule. By inspecting the new IP address and subnet mask closely in the Django Admin interface (after the MyData instance updates its Uploader record), you can fix the problem in the firewall rule. This task (creating the firewall rule) is not intended to be automated by MyData or MyTardis - this is intended to be done manually by a MyTardis administrator after receiving a request from MyData.

The MyData 'new instrument' button can send some info about this over in a 'new instrument' request. Also, networking changes over time so storing it on the MyTardis side risks out of date information (and we're never going to sync that, that's going way too far!).

Sorry, maybe this part wasn't clear. Every time MyData starts up it checks for an existing Uploader record matching its MAC address. If it doesn't find one, it creates one. But if it finds an existing record, it updates it (using PUT instead of POST), so if a rebooted instrument PC gets a different IP address, the MyTardis administrator will be able to see the latest IP address registered from the most recent time MyData was run on that PC, so then the MyTardis administrator can double-check whether the IP range they have specified in their iptables firewall or /etc/hosts.allow includes the new IP address.

I greatly admire that you're designing this as a system that eventually can exist on its own without human intervention.

I would say that "without human intervention" is an exaggeration - I'm certainly not aiming for that right now. But I would like to have a system (MyData) in place on each instrument PC which facility managers can trust to push data / meta-data / diagnostic error messages to a central location so that facility managers can spend more time using MyTardis's Facility View to look at multiple instruments data upload statistics on one screen, and less time trying to find a vacancy in the instrument booking calendar to physically visit an individual instrument PC to try to diagnose a problem directly on the PC in question.

I'm sure many are carrying the assumption that facilities across campus/universities/states can integrate a couple of hundred instruments or more over time in an automated fashion. ...

I'm not making any assumptions like that at this stage - I'm just doing my best to get one facility feeling like they can roll out a reasonably reliable, robust, user friendly, and transparent (to facility managers) research data management system (including a GUI upload manager) to their users. I completely agree about the lack of manpower - there will always be facilities who have expectations of integrating a new research data management system with an existing workflow, and existing workflows can vary immensely in complexity.

Cheers, James

steveandroulakis commented 9 years ago

Just quickly,

"I disagree. I would agree if you had said "No amount of information ... can truly solve all issues here", but "any issues"? Really?"

I meant "all" here not "any". Not because of a typo but because I meant any as in this can't solve "any issue encountered at random, occurring from the infinite amount of issues that could be encountered", not "can't solve any issue at all". Or perhaps replace "any" with "every".

So we agree on that part.

Steve

On Fri, Nov 21, 2014 at 10:22 PM, James Wettenhall notifications@github.com wrote:

Hi Steve, Thanks for your comments! I have added some replies inline.

It's still very early days with this instrument integration work, and much may change in the future in how both the technical aspects and the 'business' of integration is carried out. I agree 100% about the risk of change etc. I think my perspective is slightly different from yours and Grischa's, in that I think it's really important to try hard to get this working well for one facility first in a reasonable time frame, even if it involves making a few rushed design decisions which might need to be reworked later on when expanding to other facilities. I wouldn't care if models I added into MyTardis got merged in with a warning attached that they are considered unstable and may be deprecated soon, until we establish their widespread value. I started this recent discussion thinking that it would be a big win for the "usability" of MyData to be able to state in MyData's User Guide that MyData is compatible with an "official" version of MyTardis, but another part of me thinks we should just focus on getting things right for one facility first, and it doesn't matter if that facility is using the "mydata" fork of MyTardis for now. But having said that, I'm s till ver y keen to discuss what changes you and Grischa would like to see in the "mydata" fork before being interested in merging it into the main MyTardis repository. The 'business' that surrounds a new instrument integration in a new or existing facility is going to be messy and drawn out process requiring meetings, back and forth interaction between experts on both sides and a bit of 'hand holding'. I think the priority is to get MyData working nicely for one facility initially, and for the facility we are targeting, we have already advanced through a lot of the drawn out meetings and 'hand holding'. Facilities may have to make minor changes or concessions in their data management workflow to be able to participate with us. Yes, I try to be a middle man, compromising between what is good for the facility I'm working with and what is good for MyTardis in general. I spent some significant time earlier this year ensuring that MyData doesn't require a MyTardis administrator account (to create MyTardis accounts from folder names), not because the facility I'm working with cares about that, but because the local MyTardis developer community cares about being able to merge multiple MyTardis's into one without stepping on each others' toes. When things go wrong with Strudel, the user submits a debug report that goes to the team via email or ticket or both. The same could happen with MyData, and not require any information to be stored or disseminated by MyTardis in the process. Strudel POSTs the debug report to a server (cvl.massive.org.au) which dumps it into a text file in a regular directory on disk, and then a CRON job generates emails. So it doesn't go in a database, but the information is still stored on a server, until an administrator chooses to delete it. For MyData, we could replace the directory on disk with a MyTardis storage box and replace the CRON job with a Celery task, or we could maintain a separate CGI script outside of MyTardis if you prefer, but I don't really understand why we would want to, given that MyTardis and TastyPie already have mechanisms to handle POSTed data. I'm assuming that all of these assumptions are reasonable, because as nice as MyData and MyTardis can be, it's not like putting toast into a toaster -- You can't just integrate instruments like you can buy pre-packaged sliced bread ready to go and put it into your 0-setup store-bought toaster. I'd like it to be like this, but I think it's premature to assume this is how it will go based on experience with 1 facility. I'm not assuming anything about what will happen beyond 1 facility at this stage. I'm just assuming that we need a system which appears user friendly, robust and reliable by the time we go into production. And sure, things will go wrong even with an application which aims to be user friendly - that's why you have to work so hard to do "clever" things to make users glad that they are using an automated program despite the inevitable bugs which could otherwise make the facility manager question whether they want to recommend the system to their researchers as an official facility policy. A button in MyData could say 'set up a new instrument', ask the user for their email address, a meaningful instrument name and description and a Facility (that comes from a target MyTardis DB) or if its a new facility perhaps that's a conversation significant enough to warrant discussion before going ahead with a MyTardis admin creating a facility entry. This info is then registered with MyTardis with a MAC address and a key for access to MyTardis is retreived. Then an email goes to a MyTardis sysadmin for approval of the instrument. As discussed in my reply to Grischa's comment, I don't want to have to deal with the complexity (SMTP configuration etc.) of sending automated emails directly from a client application - it's much simpler to POST to a server first, and have the server take care of sending emails (e.g. using CRON or Celery). In many cases, I don't really care if the info submitted from MyData is POSTed into a database model or into a location outside of the database, but I can see advantages of using the locations managed by MyTardis for these requests/info submitted from MyData, even if the request/info records (managed by the database application) are only meaningful for a limited time, just as messages in the djkombu_message table (used by Celery) are only meaningful for a limited time. I'm cool with many MAC addresses being associated with an instrument (covering for various NICs) as a security mechanism. For the facility we are targeting (and probably for any other facility), I don't regard the method I'm currently proposing as being a "security mechanism" but more a deterrent for people who might consider trying to meddle with the facility's data store. If we wanted to turn it into a "security mechanism", MyData would not be storing a private key on each instrument PC. Instead, users would have to plug in a USB stick with their own personal private key every time they used MyData, but that would require extra development time to deal with USB stick insertion and removal events, and I don't think our facility stakeholders want to inconvenience their users in that way.
For me, networking for this kind of issue can have infinite problems based on infinite complexity of networking setups and firewalls. No amount of information stored by MyTardis nor any system logic can truly solve any issues here. I disagree. I would agree if you had said "No amount of information ... can truly solve all issues here", but "any issues"? Really?. I have logged this type of information before in previous database applications I have worked on, and found it to be useful. The main value is for creating firewall / iptables / hosts.allow rules for giving your instrument PC RSYNC over SSH access to a staging server, and determining whether the IP address as seen by the staging server's sshd changes within a defined IP range, or whether it is fixed. For example, if an instrument PC is rebooted and gets a new IP address, and suddenly it can't connect to the staging server, it could be because you (the MyTardis administrator) specified the PC's IP range incorrectly in the firewall rule. By inspecting the new IP address and subnet mask closely in the Django Admin interface (after the MyData instance updates its Uploader record), you can fix the problem in the firewall rule. This task (cr eating t he firewall rule) is not intended to be automated by MyData or MyTardis - this is intended to be done manually by a MyTardis administrator after receiving a request from MyData. The MyData 'new instrument' button can send some info about this over in a 'new instrument' request. Also, networking changes over time so storing it on the MyTardis side risks out of date information (and we're never going to sync that, that's going way too far!). Sorry, maybe this part wasn't clear. Every time MyData starts up it checks for an existing Uploader record matching its MAC address. If it doesn't find one, it creates one. But if it finds an existing record, it updates it (using PUT instead of POST), so if a rebooted instrument PC gets a different IP address, the MyTardis administrator will be able to see the latest IP address registered from the most recent time MyData was run on that PC, so then the MyTardis administrator can double-check whether the IP range they have specified in their iptables firewall or /etc/hosts.allow includes the new IP address. I greatly admire that you're designing this as a system that eventually can exist on its own without human intervention. I would say that "without human intervention" is an exaggeration - I'm certainly not aiming for that right now. But I would like to have a system (MyData) in place on each instrument PC which facility managers can trust to push data / meta-data / diagnostic error messages to a central location so that facility managers can spend more time using MyTardis's Facility View to look at multiple instruments data upload statistics on one screen, and less time trying to find a vacancy in the instrument booking calendar to physically visit an individual instrument PC to try to diagnose a problem directly on the PC in question. I'm sure many are carrying the assumption that facilities across campus/universities/states can integrate a couple of hundred instruments or more over time in an automated fashion. ... I'm not making any assumptions like that at this stage - I'm just doing my best to get one facility feeling like they can roll out a reasonably reliable, robust, user friendly, and transparent (to facility managers) research data management system (including a GUI upload manager) to their users. I completely agree about the lack of manpower - there will always be facilities who have expectations of integrating a new research data management system with an existing workflow, and existing workflows can vary immensely in complexity. Cheers,

James

Reply to this email directly or view it on GitHub: https://github.com/wettenhj/mytardis/issues/4#issuecomment-63957641

wettenhj commented 9 years ago

Ah, I understand now.

<Imagining a different intonation on the word "any" now.>

On 21 Nov 2014, at 11:51 pm, Steve Androulakis notifications@github.com wrote:

Just quickly,

"I disagree. I would agree if you had said "No amount of information ... can truly solve all issues here", but "any issues"? Really?"

I meant "all" here not "any". Not because of a typo but because I meant any as in this can't solve "any issue encountered at random, occurring from the infinite amount of issues that could be encountered", not "can't solve any issue at all". Or perhaps replace "any" with "every".

So we agree on that part.

Steve

On Fri, Nov 21, 2014 at 10:22 PM, James Wettenhall notifications@github.com wrote:

Hi Steve, Thanks for your comments! I have added some replies inline.

It's still very early days with this instrument integration work, and much may change in the future in how both the technical aspects and the 'business' of integration is carried out. I agree 100% about the risk of change etc. I think my perspective is slightly different from yours and Grischa's, in that I think it's really important to try hard to get this working well for one facility first in a reasonable time frame, even if it involves making a few rushed design decisions which might need to be reworked later on when expanding to other facilities. I wouldn't care if models I added into MyTardis got merged in with a warning attached that they are considered unstable and may be deprecated soon, until we establish their widespread value. I started this recent discussion thinking that it would be a big win for the "usability" of MyData to be able to state in MyData's User Guide that MyData is compatible with an "official" version of MyTardis, but another part of me thinks we should just focus on getting things right for one facility first, and it doesn't matter if that facility is using the "mydata" fork of MyTardis for now. But having said that, I'm s till ver y keen to discuss what changes you and Grischa would like to see in the "mydata" fork before being interested in merging it into the main MyTardis repository. The 'business' that surrounds a new instrument integration in a new or existing facility is going to be messy and drawn out process requiring meetings, back and forth interaction between experts on both sides and a bit of 'hand holding'. I think the priority is to get MyData working nicely for one facility initially, and for the facility we are targeting, we have already advanced through a lot of the drawn out meetings and 'hand holding'. Facilities may have to make minor changes or concessions in their data management workflow to be able to participate with us. Yes, I try to be a middle man, compromising between what is good for the facility I'm working with and what is good for MyTardis in general. I spent some significant time earlier this year ensuring that MyData doesn't require a MyTardis administrator account (to create MyTardis accounts from folder names), not because the facility I'm working with cares about that, but because the local MyTardis developer community cares about being able to merge multiple MyTardis's into one without stepping on each others' toes. When things go wrong with Strudel, the user submits a debug report that goes to the team via email or ticket or both. The same could happen with MyData, and not require any information to be stored or disseminated by MyTardis in the process. Strudel POSTs the debug report to a server (cvl.massive.org.au) which dumps it into a text file in a regular directory on disk, and then a CRON job generates emails. So it doesn't go in a database, but the information is still stored on a server, until an administrator chooses to delete it. For MyData, we could replace the directory on disk with a MyTardis storage box and replace the CRON job with a Celery task, or we could maintain a separate CGI script outside of MyTardis if you prefer, but I don't really understand why we would want to, given that MyTardis and TastyPie already have mechanisms to handle POSTed data. I'm assuming that all of these assumptions are reasonable, because as nice as MyData and MyTardis can be, it's not like putting toast into a toaster -- You can't just integrate instruments like you can buy pre-packaged sliced bread ready to go and put it into your 0-setup store-bought toaster. I'd like it to be like this, but I think it's premature to assume this is how it will go based on experience with 1 facility. I'm not assuming anything about what will happen beyond 1 facility at this stage. I'm just assuming that we need a system which appears user friendly, robust and reliable by the time we go into production. And sure, things will go wrong even with an application which aims to be user friendly - that's why you have to work so hard to do "clever" things to make users glad that they are using an automated program despite the inevitable bugs which could otherwise make the facility manager question whether they want to recommend the system to their researchers as an official facility policy. A button in MyData could say 'set up a new instrument', ask the user for their email address, a meaningful instrument name and description and a Facility (that comes from a target MyTardis DB) or if its a new facility perhaps that's a conversation significant enough to warrant discussion before going ahead with a MyTardis admin creating a facility entry. This info is then registered with MyTardis with a MAC address and a key for access to MyTardis is retreived. Then an email goes to a MyTardis sysadmin for approval of the instrument. As discussed in my reply to Grischa's comment, I don't want to have to deal with the complexity (SMTP configuration etc.) of sending automated emails directly from a client application - it's much simpler to POST to a server first, and have the server take care of sending emails (e.g. using CRON or Celery). In many cases, I don't really care if the info submitted from MyData is POSTed into a database model or into a location outside of the database, but I can see advantages of using the locations managed by MyTardis for these requests/info submitted from MyData, even if the request/info records (managed by the database application) are only meaningful for a limited time, just as messages in the djkombu_message table (used by Celery) are only meaningful for a limited time. I'm cool with many MAC addresses being associated with an instrument (covering for various NICs) as a security mechanism. For the facility we are targeting (and probably for any other facility), I don't regard the method I'm currently proposing as being a "security mechanism" but more a deterrent for people who might consider trying to meddle with the facility's data store. If we wanted to turn it into a "security mechanism", MyData would not be storing a private key on each instrument PC. Instead, users would have to plug in a USB stick with their own personal private key every time they used MyData, but that would require extra development time to deal with USB stick insertion and removal events, and I don't think our facility stakeholders want to inconvenience their users in that way. For me, networking for this kind of issue can have infinite problems based on infinite complexity of networking setups and firewalls. No amount of information stored by MyTardis nor any system logic can truly solve any issues here. I disagree. I would agree if you had said "No amount of information ... can truly solve all issues here", but "any issues"? Really?. I have logged this type of information before in previous database applications I have worked on, and found it to be useful. The main value is for creating firewall / iptables / hosts.allow rules for giving your instrument PC RSYNC over SSH access to a staging server, and determining whether the IP address as seen by the staging server's sshd changes within a defined IP range, or whether it is fixed. For example, if an instrument PC is rebooted and gets a new IP address, and suddenly it can't connect to the staging server, it could be because you (the MyTardis administrator) specified the PC's IP range incorrectly in the firewall rule. By inspecting the new IP address and subnet mask closely in the Django Admin interface (after the MyData instance updates its Uploader record), you can fix the problem in the firewall rule. This task (cr eating t he firewall rule) is not intended to be automated by MyData or MyTardis - this is intended to be done manually by a MyTardis administrator after receiving a request from MyData. The MyData 'new instrument' button can send some info about this over in a 'new instrument' request. Also, networking changes over time so storing it on the MyTardis side risks out of date information (and we're never going to sync that, that's going way too far!). Sorry, maybe this part wasn't clear. Every time MyData starts up it checks for an existing Uploader record matching its MAC address. If it doesn't find one, it creates one. But if it finds an existing record, it updates it (using PUT instead of POST), so if a rebooted instrument PC gets a different IP address, the MyTardis administrator will be able to see the latest IP address registered from the most recent time MyData was run on that PC, so then the MyTardis administrator can double-check whether the IP range they have specified in their iptables firewall or /etc/hosts.allow includes the new IP address. I greatly admire that you're designing this as a system that eventually can exist on its own without human intervention. I would say that "without human intervention" is an exaggeration - I'm certainly not aiming for that right now. But I would like to have a system (MyData) in place on each instrument PC which facility managers can trust to push data / meta-data / diagnostic error messages to a central location so that facility managers can spend more time using MyTardis's Facility View to look at multiple instruments data upload statistics on one screen, and less time trying to find a vacancy in the instrument booking calendar to physically visit an individual instrument PC to try to diagnose a problem directly on the PC in question. I'm sure many are carrying the assumption that facilities across campus/universities/states can integrate a couple of hundred instruments or more over time in an automated fashion. ... I'm not making any assumptions like that at this stage - I'm just doing my best to get one facility feeling like they can roll out a reasonably reliable, robust, user friendly, and transparent (to facility managers) research data management system (including a GUI upload manager) to their users. I completely agree about the lack of manpower - there will always be facilities who have expectations of integrating a new research data management system with an existing workflow, and existing workflows can vary immensely in complexity. Cheers,

James

Reply to this email directly or view it on GitHub: https://github.com/wettenhj/mytardis/issues/4#issuecomment-63957641 — Reply to this email directly or view it on GitHub.