Figure out source data - Githubissues

ywyw commented 9 years ago

Max and I have located the WISE catalog and the ATLAS images. The catalog is 400GB of ASCII data: http://irsadist.ipac.caltech.edu/wise-allwise/wget_gz.script. And we are not sure where to get the raw ATLAS FITS files. We decided to start with WISE but we want to build importers for 2MASS and SDSS as well in the future, with the goal of producing the same normalized data from all 3 sources eventually.

What is the relationship between the WISE catalog and the ATLAS images as it relates to how we are going to feed data into the Trillian project?
What metadata do we need to extract from the raw data for Trillian to be able to build a Pixel database? Is it part of our job to do the indexing/conversion of the catalog data (currently in ra/dec) into Pixels?
Do we need both the images and the catalog or should we only look at one or the other?
It was mentioned that we might start with a small section of the sky first, is this something that would make sense? If so, how should we proceed?

cc @maxogden @demitri

We can jump on a hangout sometime if that is easier.

demitri commented 9 years ago

Yeah, I’m having a problem finding the raw images as well (you can start to see the problems astronomers face). I’ve sent an email to a colleague that works with them; I’ll let you know ASAP. To answer your questions:

What is the relationship between the WISE catalog and the ATLAS images as it relates to how we are going to feed data into the Trillian project?

When you start with a survey like this, the first result is a large number of image files. The look like what you'd expect - images of stars, galaxies, noise artifacts, etc. Software is then run on the images that identifies individual stars and galaxies. For each of these, many measurements are made (brightness, area, location, etc.). These measurements become the catalog: one row for each object. You can imagine that if you change the algorithm as to how to identify/separate objects, the resulting number of the rows in the catalog will change.

This brings up an interesting point that was raised recently. Let's say I want to perform an analysis of "all objects in region x in the sky". The WISE catalog will have a list of items from that region, but the SDSS catalog will have a different list. The latter has a higher resolution than the former, so two stars that are close together may appear as a single object in WISE but are clearly separated in SDSS. So if you want to loop over "all objects", where is that master list? The answer is that Trillian should create a "master catalog", which is the superset of all available catalogs. Cross tables would be built so that when one looked up object 12345 in Trillian, we'll know it's matched against object 9876 in SDSS and 6543 in WISE. This might be an implementation detail above the level you are working, but it's worth mentioning.

The aim for Trillian is to have the catalog information in a database and the images available. How this is done is the implementation detail. I'm going to now coin the term "trixel" - a Trillian (Healpix) pixel. (Just capitalizing "pixel" still might be confusing.) A trixel may consist of a PostgreSQL database with the catalog data, plus the imaging data for that piece of the sky. A remote node that contains 10 trixels might share the same database (no point in creating 10 PostgreSQL instances) and some amount of disk space for the images. But this is getting into implementation details, which I don't want to constrain you on. Hopefully this all makes sense.

What metadata do we need to extract from the raw data for Trillian to be able to build a Pixel database?

I would say almost nothing. How you extract metadata from the images is really dependent on what kind of analysis you are doing, so it's very hard to generalize that. It's also not something I'd expect to put into the database. This is where the idea of a translator comes in between DAT and the raw images - a request to DAT for data from an image is passed to a translator which handles the actual work, then returns the value to DAT to pass back. This is the simplest (but still extremely useful) case I can think of: given an ra/dec position, "cut out" the pixels within a radius (e.g. 1 degree) and add up all of their values. And even this is not trivial; that region may cover one, two, three, or potentially more files. (There is existing code to do this which we can use - http://montage.ipac.caltech.edu/index.html .) What would be nice to have DAT do is given the ra/dec, look up in the database the names of files need, and this can be passed to a translator that calls Montage and returns the result. I would certainly not go beyond this now; just implementing this will allow any flexibility as one would just write a new translator and define the inputs and outputs. Further, this information (e.g. the center ra/dec of the image, the size of the image in pixels, the size of the image in degrees) is all located in the header of these files. If one were to just place all of that information into the trixel along with the filename, you're basically done. The best code to read FITS files is here: https://github.com/esheldon/fitsio.

Is it part of our job to do the indexing/conversion of the catalog data (currently in ra/dec) into Pixels?

Well, it's a job to be done, and I wouldn't complain if you think you have some good ideas how to do it! That said, the Mozilla ScienceLab will shortly be putting some some developers onto Trillian, so it would be worth sitting down and organizing the effort as appropriate.

Do we need both the images and the catalog or should we only look at one or the other?

Ultimately both will be needed. If you find it easier to work with just the catalog information first, that would be fine (since you can at least plan knowing the images will need to be added). But given the above, you can also put the contents of the FITS header metadata into the trixel database. With the exception of the data translator described above, that's 90% of what you'd need to do with images for now I think.

It was mentioned that we might start with a small section of the sky first, is this something that would make sense? If so, how should we proceed?

Yes, absolutely. I don't have the disk space to do the whole sky (>50TB), but it's not useful to do that now. Initially I want to build a proof of concept that covers the central "command and control" database, the ability to build trixels from existing data, and the ability to communicate with trixels located on a remote server (i.e. the protocol/API/etc.). Trixels in a sense would be semi-autonomous in the sense that they run their own PostgreSQL instance (per node).

The region of sky I've chosen is called "Stripe 82", which is defined by these coordinates:

306° < ra < 62° -1.27° < dec < +1.27°

Note that the ra coordinate crosses zero, so you can also express it as (particularly if you are writing an SQL query):

306 < ra < 360 and 0 < ra < 62

Let me know what else I can help with, particularly with code that deals with astronomical data. I want to make sure no time is spent reinventing the wheel! Thanks!

demitri commented 9 years ago

My colleague Dustin Lang (@dstndstn) has done a lot of work on WISE imaging, particularly in unblurring the images: http://unwise.me. The satellite took several images of most parts of the sky, and while ultimately I’d like to include each pass individually, for now it’s more sensible to use the images added together (called “co-adds”). Dustin has made available his processed images, so I’d like to use those.

He sent me this script to select Strip 82 images:

wget http://unwise.me/data/tiles.fits
fitscopy tiles.fits"[dec>-1.6 && dec<1.6]" stripe82-tiles.fits
tablist stripe82-tiles.fits"[col coadd_id]" | tail +3 | awk '{print $2}' | awk '{printf "wget -c http://unwise.me/data/%s/%s/unwise-%s-w1-img-m.fits\n", substr($1,0,3),$1,$1}' > download.sh
chmod 755 download.sh
./download.sh

The fitscopy program is included in the fitsio software, a C interface to FITS files that nearly all FITS code is built on top of (https://heasarc.gsfc.nasa.gov/docs/software/fitsio/). @maxogden, I'll just email you the download script so you don’t have to worry about building the utilities above (though they might be useful in extracting header information, or using Astropy.io.fits. The total data will be ~16GB.

dstndstn commented 9 years ago

By the way, that script just grabs the "W1" (3.4 micron) images; W2, W3, and W4 also exist.

demitri commented 9 years ago

Ah yes– good point. I think for our purposes now one wavelength is probably sufficient for development, but keeping in mind we will add more later. Unless you want to have two sets to start with @maxogden – it’s really up to you.

ywyw commented 9 years ago

@demitri @dstndstn Sorry to go silent for a while, I took the past week to dive pretty deep into the data and different Python libraries that I found for parsing FITS formats. One key issue that came up was this: Healpix come in many different possible resolutions and some images are full-sky surveys. So it's not entirely practical to index every pixel in the smallest resolution from such a survey, since in one case it's about 300,000 Healpixels for just one FITS file. Do you have a certain Healpix resolution in mind or would you like to base it on the size of the FITS file resolution?

ywyw commented 9 years ago

@demitri @dstndstn No actually make that 3 million Healpixels for one example I stumbled across.

demitri commented 9 years ago

@ywyw, @maxogden - Sorry for the later reply! A few things.

There are a few Python libraries for working with FITS files. Several years ago astronomical Python code was pretty scattered, but we’ve gathered most code into the Astropy project (also on GitHub). That said, there is an independent Python FITS library that I feel is better and likely to remain so for a while: fitsio. The interface is quite similar between astropy.io.fits and fitsio, but the latter is built on top of a C library, so it's fast and more flexible.

Healpix is a more complex problem. The first image on the HEALPix home page is a good place to start. The lowest possible resolution has 12 pixels. The next step up in resolution divides each of these pixels into 4, meaning that resolution has 48 pixels. This can then continue to be as fine as needed.

The concept of Healpix pixels and Trillian pixels will need to work together as Trillian adds a new component – the size on disk of the data that fits into a particular pixel. Let's assume we pick some arbitrary Healpix pixel resolution, and that two of these pixels fill a server that has 10TB available on it (the numbers here are just illustrative). Later, new data is available that would belong in those pixels, but that server has no more room. Then this would happen:

• divide the two pixels into four • move one pixel to another server • place the new data into the now smaller pixels (and in the one on the new server if applicable)

Because the data available is absolutely not homogenous across the sky, I see no reason why the Trillian pixels can't be adaptive in response to the available data and disk space.

Just with geohashing, the pixels can be indexed hierarchically. The lowest resolution would have pixels numbered from 1-12. If we then subdivide pixel 2 into four pixels (subdivisions are always into four), then we name each sub-pixel 1-4, and their "address" would be 2.1, 2.2, 2.3, and 2.4. If there is a lot of data in one region, we might see many subdivisions, e.g. 2.3.3.2.1.4. This is basically like a sparse matrix - we only subdivide the sky where necessary, and we don't need to index 3M pixels.

The last point is that the pixel indexing scheme for Trillian should not depend on any particular data set, so it should have no relation to any FITS file. Trillian should be concerned with how much disk space a pixel occupies, and divide (or even merge) pixels as needed. This will all be tracked in the central server.

I will be better at replying more quickly – I'm transitioning more of my time to Trillian now.

ywyw commented 9 years ago

@demitri Yes, currently, I have been using fitsio and healpy for most of the data munging. Are there any other Python libraries you would recommend?

Currently, what I am doing is making a simplification and using just one Postgres database for indexing a small set of FITS files, however, what you propose (having a server for every two pixels) should be viable in the future. I am not doing any file manipulation with FITS, just indexing the filename for now. It sounds like you envision a system that cuts out a particular region corresponding with a pixel within a FITS file.

The geohashing idea you have is also doable, but I am not sure how that would work on the querying side. Say you were interested in a particular object in the sky, which has some coordinates. You would then translate that into the pixel you want on the user side and query the database for the finest resolution first? I think you mentioned in an earlier exchange that one challenge with this rectangular to Healpix conversion is that the edges might have to do some computational stitching. How often are the images fullsky vs a small region?

I will be traveling for the next few days, but when I return, we should set up a meeting. A lot has been discussed so it makes sense to make sure we prioritize the right things. In the meantime, I will chew on these ideas...

demitri commented 9 years ago

@ywyw No, those sound good (along with Astropy).

I think there will be confusion in what we mean by “pixel” - currently there are three potential definitions:

• a pixel in an image (e.g. FITS file) • a HEALPix pixel • a Trillian pixel, which is mapped to a HEALPix pixel at a particular resolution and contains disparate data

Let’s call the third one a “trixel” to help this a bit.

As for the database, I envision a single instance of PostgreSQL per storage node, not per pixel (of any definition).

The filename will need to be stored and indexed, but it’s worth noting that the higher-level system will almost never query based on the filename. I will almost never know what that is beforehand, and almost never care. The vast majority of the time I will be querying with a position on the sky (e.g. ra/dec), which will then be translated to a FITS file (or very easily multiple FITS files).

The geohashing idea you have is also doable, but I am not sure how that would work on the querying side. Say you were interested in a particular object in the sky, which has some coordinates. You would then translate that into the pixel you want on the user side and query the database for the finest resolution first?

The user/client side will have no concept of the trixel scheme, and shouldn’t. The query will initially go to the central server, whose database will know where all of the available data is, i.e. which trixel(s), and thus nodes. Even at this step, the resolution isn’t important; the central server would just say "the data is on node xx.xx.xx.xx and is in trixel 2.3.2.1" (for example). For a point on the sky and a radius, it may return two pixels, which don’t even have to be at the same resolution.

I think you mentioned in an earlier exchange that one challenge with this rectangular to Healpix conversion is that the edges might have to do some computational stitching. How often are the images fullsky vs a small region?

Almost no data will be full-sky - very few surveys are like that. Almost no ground-based telescopes perform all-sky images (for obvious reasons, with the exception of 2MASS which used two telescopes, one in each hemisphere), though there are several space-based surveys that are (WISE, AKARI, GALEX, Planck, etc.). The latter produce huge sets of data which will certainly be split into many, many files (as you’ve seen with WISE). This is because of course you can’t have a FITS file that’s one or several TB in size, but the division into many files is pretty arbitrary. That’s why we never care about the file name.

I think I’d start by dividing the sky into the 12 lowest resolution HEALPix pixels, which will all fit on a single node (as they start with no data). As data is added, we’ll have code that starts to subdivide as needed, but it’s not needed until the storage node runs out of available space and needs to subdivide and move a trixel onto a another node.

max-mapper commented 9 years ago

@demitri I'm just starting to play around with indexing the wise catalog data for stripe 82, more on that soon, but first I wanted to build something for visually selecting a ra/dec region.

Can you give me a sanity check and tell me if this link correctly opens up a blue rectangle on the boundary of stripe 82? http://maxogden.github.io/ra-dec-bbox-finder/#-1.270000,-118.000000,1.270000,126.000000

demitri commented 9 years ago

I think it's close, but there is a discrepancy between that and what you see on Google Sky. First, the coordinates in your map appear reversed - this might be due to the GIS projection you are using (or something related to that). Can you specify the SRID string in that visualization? If so, you should use this:

"+proj=longlat +a=57.2957795 +b=57.2957795 +no_defs"

This sets the projection to a perfect sphere, and the major and minor axes are set to 1 radian in degrees. This way the units on the map are displayed in degrees.

But this might be going a little further afield. I think having a good visualization for this is very important, but just doing a simple range check in RA/dec using the coordinates above should be sufficient.

max-mapper commented 9 years ago

@demitri ok cool, I updated it a little and fixed the coordinate issues (I think). not formatting them 100% correct but the numbers should be mostly correct

new URL: http://maxogden.github.io/ra-dec-bbox-finder/#-1.27,62,1.27,306

I just wanted something simple, so this is probably good enough for now

update I messed up the ra/dec order in the link above, here's some fixed links:

http://maxogden.github.io/ra-dec-bbox-finder/#163.470000,-49.620000,163.850000,-49.360000 http://maxogden.github.io/ra-dec-bbox-finder/#62,-1.27,306,1.27

max-mapper commented 9 years ago

@demitri OK I have a prototype that I'd like some feedback on

I set up http://trillian.dathub.org/, which is just a CNAME that points to the digital ocean droplet I made for us last week. I'm currently running a dat on port 80 there. The dat has the results from a script here: https://github.com/maxogden/dat-wise-stripe-82 in it

What the above script does is:

cat wise-allwise-cat-part25.gz | \
gunzip | \
csv-parser --headers=$(cat columns.tsv) --separator="|" | \
jsonmap "{ra: +this.ra, dec: +this.dec, key: this.source_id}" | \
jsonfilter --match="(306 < this.ra && this.ra < 360 && 0 < this.ra < 62) && (-1.27 < this.dec && this.dec < 1.27)" | \
node index.js

Here's a breakdown:

parse csv
filter only columns we are interested in (we can add more later, right now I only grab ra, dec, source_id)
filter out data not in stripe 82
finally node index.js uses the ra and dec to generate a geohash for each item

What gets produced is a stream of objects like this:

{
  "ra": 352.21649919999993,
  "dec": 0.7579901000000001,
  "key": "upbxcpzzyrÿ3524p015_ac51-008590"
}

key is in the format <geohash>ÿ<source_id> (we use ÿ as the separator because it's the highest sorting unicode character, \xff)

So now that the keys are geohashed I have implemented a crude query system on top, using only dat as the backend:

$ node query.js 305.999064,-1.270000,306.000000,-1.269482 9
Querying 13 geohash regions
{ gte: 'gzzpgpgpy', lt: 'gzzpgpgpz', limit: -1 } 4
{ gte: 'gzzpgpgpz', lt: 'gzzpgpgq', limit: -1 } 28
{ gte: 'gzzpgpgq', lt: 'gzzpgpgrc', limit: -1 } 40
{ gte: 'gzzpgpgrc', lt: 'gzzpgpgrd', limit: -1 } 38
{ gte: 'gzzpgpgrd', lt: 'gzzpgpgrg', limit: -1 } 34
{ gte: 'gzzpgpgrg', lt: 'gzzpgpgrh', limit: -1 } 44
{ gte: 'gzzpgpgrh', lt: 'gzzpgpgrv', limit: -1 } 40
{ gte: 'gzzpgpgrv', lt: 'gzzpgpgrw', limit: -1 } 40
{ gte: 'gzzpgpgrw', lt: 'gzzpgpgrz', limit: -1 } 27
{ gte: 'gzzpgpgrz', lt: 'gzzpgpgs', limit: -1 } 27
{ gte: 'gzzpgpgs', lt: 'gzzpgpgxc', limit: -1 } 36
{ gte: 'gzzpgpgxc', lt: 'gzzpgpgxd', limit: -1 } 34
{ gte: 'gzzpgpgxd', lt: 'gzzpgpgxg', limit: -1 } 35
done 427

So this isn't quite ideal because we aren't using the healpix projection, but @ywyw is working on a modified geohash that projects ra/dec into healpix first: https://github.com/ywyw/dat-wise/blob/master/radec2healpix.py

The summary of @ywyw's work is to re-implement geohashes using healpix as the space saving projection, as geohashes normally use lat/lon which is quite distorted.

This raises the question of how to query healpix geohashes on top of dat, since they add some complexity over vanilla geohashes, and we started a thread here to track that topic https://github.com/ywyw/dat-wise/issues/4

The main benefit I can see so far of this approach is that we can dump all of the catalog data into dat using an indexing scheme like this and allow streaming access to any ra/dec region. In the future when we start working with fits files we can also store these in dat and using the same indexing scheme to query them.

In addition to geohashes I found a variety of other very similar reduced dimensionality indexing schemes (S2 cells, quadkeys, nontrees), each one presents subtle tradeoffs. But geohashes so far are the most widely implemented so I started there.

This experiment has raised lots of questions for dat as well, and has been really useful so far in helping us shape dat in the future. Here are a couple of takeaways I've had so far:

filtered pull would be nice here e.g. if you could replicate a ra/dec bounding box
For extremely large, static datasets like the wise catalog data it would be nice if we could disable dat's versioning + changes feed to save space

This is just the first phase, we still have to figure out how to stream data from dat into postgres Trixel dbs, but @demitri do you think this is a good start to enabling that to happen?

demitri commented 9 years ago

Let me take a half a step back and ask what we are getting from the geohashing? Is this to generate a unique ID to query on? For the most part, the only use for Healpix for me is a way to divide the data into pieces that can be moved to different nodes. An incoming query on ra/dec would need to be translated to the Healpix-based Trixel node, and the query then performed there. (Of course, multiple nodes might be returned.)

I think there are existing indexing schemes (q3c is really ideal for this); I'd avoid inventing something something new when possible. My understanding is that DAT can sit on top of something that exists (PostgreSQL + q3c, for example), and provide the data transport that these do not. But I might be missing something.

max-mapper commented 9 years ago

@demitri good question -- the geohashes are only used to create a key that can support ra/dec queries directly from dat.

I'm not sure yet if we really need to put the catalog data into dat or not, but so far it's a lot nicer working with data in dat than it is with the 9gb gzipped CSV files.

This will make it easier to do things on the fly like 'get all of stripe 82 and pipe it into trixel dbs'. With the CSVs you have to download + process a minimum of 15 million objects at a time due to the way the WISE catalog is distributed.

Basically what I have now is something that we can dump the catalog data into and have a nice streaming interface to a ra/dec bounding box region. In the future we can instead just parse the huge CSVs and store each item directly into a psql trixel DB. This will be a good opportunity for us to write a high performance psql bulk insert plugin for dat (as we mentioned on the last video call).

So right now I see dat having two roles:

Accessing catalog data (ra/dec bounding box query streams)
Accessing FITS files (binary streams)

If accessing the raw catalog data in a more finite way than the 9GB CSVs isn't as important then we don't have to continue further, but so far it's been a fun project to dive into astronomical data with. For example I could run a job on our droplet that imports the wise catalog data w/ 10 columns out of the 300, puts it in a dat, and then it makes it easy to select regions of that catalog for export. I have been thinking of it as a sort of first pass at indexing the catalog, just to improve slightly on the CSV distribution.

demitri commented 9 years ago

@maxogden OK, I see. You are right, the CSVs are certainly inconvenient. However, I consider them to be a transport/archive format. The first thing I do with data like this is dump it all into a database - the whole thing. It's just not worth working with. This is an easy decision to make since the whole data set is 9GB. That's (in this scale of things) a trivial amount of data, even if compressed. Given that, I'd only ever make one full pass on the data. Once that's done, I create a q3c index on the ra/dec columns.

But you are right: in this case Trillian is a little different. Here, I need to extract a subset of this for packaging into a trixel. I may do this several times as I create a few trixels, and seeking over 9GB of data is time consuming. I see two options for non-image data:

If the data set is manageable, just import the entire thing into the database on the central Trillian server, and index it by at least ra/dec with q3c. Some other code will then extract the subset that is needed to place into a requested trixel. No need for geocaching or a custom primary key, and the CSV files can be forgotten.
If the data set is too big for this, then do something more along the lines that you have suggested. Extract only a few identifying fields (e.g. ra/dec), the filename each point came from (better, add a second table and save the file id), and optionally the row number from the file. Index all of these fields. This way, any requested region can be extracted from the source files with a minimum of trouble.

The idea is to leverage the high performance of indexing/searching capabilities of PostgreSQL and q3c. The data set would have to be ridiculously large for the second option to not work - we don’t have catalogs that big (yet). There is absolutely nothing special about the CSV files themselves; to me, the natural state of this kind of data is in a database. The CSV files are just a portable format. I feel that DAT's strength would be as the layer that sits on top of the database providing the data. No need to reimplement indexing/search when that problem is well solved (and is continually improved upon with new PostgreSQL releases).

ywyw / dat-wise

Figure out source data #1