zackbatist / open-archaeo

A list of open source archaeological software and resources
https://open-archaeo.info
Creative Commons Zero v1.0 Universal
89 stars 17 forks source link

Not just "What do we have?" but also "What is missing?" #1

Open nevrome opened 5 years ago

nevrome commented 5 years ago

In a brief exchange with @carlosrellan on twitter a few months ago I had the idea to create a ToDo list for archaeological open source software. This might be a great resource for thesis projects, hackathons, unconferences, summer schools etc.

On Friday this idea emerged again in a conversation with @nmueller18 and I remembered this new "What do we have" list by you, @zackbatist. Your list could be a starting point for a collection of ToDos. The community of practice for digital archaeology is pretty small, so it could be possible to establish it as a common hub for ideas and feature requests. Or does everybody prefer to protect their good project ideas?

@carlosrellan was asking for a harris matrix R package, @nmueller18 had several great ideas to reimplement algorithms that already existed in the past but are lost to framework progress and I think there is -- just as an example -- a desperate need for a simple graph/network transformation package in R.

Maybe @joeroe, @MartinHinz, @florianthiery, @dirkseidensticker or @benmarwick have an opinion or some more ideas?

zackbatist commented 5 years ago

Well I started keeping this list because I wanted to keep track of what kinds of digital tools had been made by archaeologists, and I've seen similar lists made for other disciplines or programming industries on GitHub. As I started browsing people's repositories to get the initial draft going (and as a form of procrastination, of course :P), here are a few things I've noticed: many projects

  1. seem to be abandoned,
  2. have only a single contributor,
  3. are made for specific purposes, situations or database schemas,
  4. are not strictly archaeological (i.e. scripts used to extract data from a database), and
  5. don't fit easily into neat category headers, or at least not the one's that I've come up with so far.

I hesitate to present this as a complete or authoritative list, since it obviously depends upon the interests and judgement calls of the people drawing it up. I guess it can be considered more as an informed survey of what's out there, of what kinds of tools had been made, and of how archaeologists develop digital tools to suit their needs and the needs of others around them. So I think that while it might be useful to use this as a tool for learning and mobilizing further work, more work needs to be done to determine its scope and to set transparent guidelines for how we document the field.*

That being said, perhaps a wishlist could be added as a separate file. I just fear that adding it to the main list risks making it passively prescriptive concerning what we consider to be ideal targets. I also have my doubts concerning whether people will actively share their ideas beyond a certain point. Moreover, I think some of the greatest things seem to emerge serendipitously or on an as-needed basis. However, I definitely see the value in using this as a point of departure for further work as well. So I created an associated gitter forum that might enable people to chat or create their own rooms to discuss potential avenues for further development - gitter.im/open-archaeo. I'm open to other ideas though.

*One example is my indecisiveness about whether to include each repository's maintainer, since as open projects they do not 'belong' to any single individual. However, I also understand that obtaining personal credit is an important aspect of working within academic or other professional environments. I'm still unsure about how I would reconcile this issue.

Another example is that many tools have very generic names or are not designed to be used as-is beyond the specific confines of a specific project, so what really differentiates them from other similar work is the context of their creation and intended implementation. We therefore run into a problem of differentiating between code designed to resolve a very specific issue -- which may or may not be commonly faced in other projects, and which would determine its generalizability (i.e. a script that transforms the points on my specific grid pattern to WGS84 compliant format, vs a script that helps with cleaning up an extremely messy grid that has undergone very idiosyncratic changes over the years) -- and coding projects that are clearly designed to resolve or integrate with 'fundamental' archaeological issues or practices (i.e. representing the relations among single contexts), which are commonly dealt with across a range of projects.

joeroe commented 5 years ago

Interesting discussion. I like the idea of maintaining a "to do" list. A big part of that might be making viable open source alternatives for analyses that are currently usually done in proprietary or closed source software, e.g. Bayesian radiocarbon calibration, geophysics processing. Probably some people will feel protective of their ideas but that's an attitude we should try to combat: if another person starts a project you've been "meaning to do", see it as an invitation to contribute, not a missed opportunity.

The question is just one of coordination, which also touches on the fragmentation & lack of maintenance issues @zackbatist has brought up. I'm aware of several attempts to set up networks of computational/digital/open archaeologists, e.g. the CAA CSS and SSLA SIGs, the open archaeology working group, ISAAKiel of course. Probably we should settle on one and run with it!

Regarding attribution, I would say it is important to cite the major contributor(s) to a repository. Open source projects don't "belong" to anyone in an intellectual property sense, but that doesn't mean the norms of academic discourse don't apply to them. They can represent a considerable amount of work and it is important to recognise that if we want writing software to be taken seriously as a scientific activity (by universities, funding bodies, etc.)

zackbatist commented 5 years ago

Maybe I'm out of sync with what you mean by using a to-do list to coordinate future work. How about someone submit a pull request and we can go from there?

nevrome commented 5 years ago

Thanks for this great input, @zackbatist and @joeroe. You mention many important points! My idea here had a pretty limited scope to begin with: Just a simple ToDo list. But I think you're right when you mention the broader questions surrounding this topic.

In my perception, most code written by archaeologists is limited to the scope of an individual project/paper. It has to be shared to assure reproducibility, but it's not exactly intended to be used for many other applications. The input and output are highly specific, the documentation is minimal and there are no generic APIs. I think we can leave this code aside for the moment.

More interesting for the general community of practice (Research Software Engineers, hello @izaromanowska !) are tools which are intended to be used for a variety of purposes. Real GUI software, libraries, packages etc. with a well-documented user interface. I think we need more of this kind of software and we need dedicated maintainers. Also, we need a network of contributors that rescue projects that are dying due to technical, financial or, horror of horrors, biological developments. I think we already lost some advanced tools/algorithms/implementations that were produced in the last 40 years. In my opinion, the SIG SSLA might be a good institution to establish such a network.

But: A list of possible projects could be independent of this (or any other) organization. I had some discussions with Néhémie Strupler about the value of decentralization and I agree with him: Maybe it's better to keep some things independent of each other. There's a balance to be found.

benmarwick commented 5 years ago

This is an interesting discussion. What software do archaeologists really need and would get wide use? I struggle to think of any, but perhaps it's because I have a high tolerance for writing a lot of awkward R code to get my stuff done.

My sense is that we already have some excellent environments for developing custom software, such as R and Python. And I am most interested in encouraging our colleagues to use those for pretty much everything. Most of the analysis that most archaeologists are doing is pretty generic and doesn't have much to gain from specialized tooling. I think we have much more to gain from a transition from point-and-click to scripted analyses. And from encouraging people to share their code and treat it like a first class research product.

So I don't really agree with @nevrome on the need for archaeologists to have GUI software, because that is a move away from scripting and reproducibility.

I think that archaeology lacks a Lakatosian hard core of theory and method that would be well-served by specialized software packages. For example, most R packages developed by archaeologists are barely used by anyone except the original authors (please prove me wrong!). Effort on those projects would not have much payoff or improve archaeology in general. I think this echos what @zackbatist wrote above. One possible exception might be to port some of the statistics in http://tfqa.com/ to an R package. Perhaps @mpeeples2008 is doing something along those lines: http://www.mattpeeples.net/resources.html and https://mattpeeples.net/?page_id=656 ?

I don't know how often you make Harris matrices in private, but I rarely see them in publication, so I don't sense a high demand for that, @carlosrellan. We have many excellent R packages for Bayesian calibration of C14 ages, e.g. Bchron and rcarbon @joeroe Geophysics is less well served, but that's a niche community, and @isaacullah has a great list of relevant tools hre: http://isaacullah.github.io/List-of-FOSS-tools-for-academics/

I wonder if there is an empirical way to make a todo list? I mean what kinds of stats and methods are most frequently mentioned in archaeological literature, that we lack an easy way to compute? I made a sketch of this more generally for science here: https://gist.github.com/benmarwick/c8977f979849eabe318771735e39d13a and it would need a bit of polishing to get useful results for a todo list. We could also look at a large sample of archaeology methods papers and see what are most frequently cited, and package up that method.

Those are some approaches that might help to ensure that software development effort by archaeologists is meaningful. However, I think the highest return on effort will be on intangibles, such as raising the visibility and awareness of these tools, and promoting their use more widely in archaeology. Community building and establishing new norms of scholarly communication (i.e. publishing) that include software (e.g. code).

zackbatist commented 5 years ago

What's been done so far has been very basic, just something to occupy my spare time and out of my interest in identifying and assembling the scattered products of various independent efforts. However, I can see this as becoming more methodological (in the sense advocated by @benmarwick) so as to produce a dataset to discern tendencies or norms. For now, though, I'm enjoying the reflection that comes with the challenge of classifying all of this stuff. I think that the schemes we use to classify the work that has already been done might be informative when planning for future work, and when fostering more effective and normal use of digital tools among archaeologists in general. But I think that more time/discussion is needed to flesh out our understanding of the ways in which archaeologists produce and use these tools before beginning to strategize software development projects that may otherwise not align with what archaeologists want/need/expect (though I totally agree with @benmarwick about getting people as familiar as possible with the ways in which their data is being transformed and represented in digital environments, and would advocate for less GUI-intensive products given my somewhat optimistic outlook about the knowledge and skills of those who wouldn't label themselves as Digital Archaeologists with a capital D).

I will try to be more detailed in my commit messages as records of these reflections (why I changed my mind about a category, why I switch things up or swap items around), and I would suggest that others do too as they contribute, if they choose to.

joeroe commented 5 years ago

@benmarwick I meant Bayesian stratigraphic modelling which, please correct me if I'm wrong, neither Bchron or rcarbon do. oxcAAR does provide an interface to OxCal to do it, though. Harris matrices are ubiquitous in contract archaeology and post-excavation work, at least in Europe, and I certainly would prefer a nice R package to do it over fiddling with them in Excel!

Also I don't think we need to see easy-to-use, GUI interfaces and scripting/reproducible research as mutually exclusive (e.g. we have shiny).

nmueller18 commented 5 years ago

@joeroe I completely agree with you in disagreeing with @benmarwick.

Harris matrices are not only for displaying stratigraphic relationships, they are also analytical tools. For example, I use Stratify by Irmela Herzog for finding loops in my layer-assignments. As nice as it is, Stratify involves additional steps of exporting and importing, it is only Free (not Open Source), it is Windows-only, and the date is already in sight when it will not run anymore. In my view, this is a prime example of a vital tool which can only be maintained by a group of people.

I think there is a "market" for specialized archaeological software, but there is little point in producing things in advance or anticipation. The challenge is to identify already on-going or desperately needed projects which are potentially of wider interest and then to evangelize them. By that, you are also setting the agenda in methodological terms. A negative example for this is the 'Greatest empty circle'-method of the Zimmermann-group for finding settlement territories which has not found a wider audience simply because there is yet no easy way to calculate it.

benmarwick commented 5 years ago

@joeroe ah yes, thanks for clarifying, I think it might be the ArchaeoPhases that does that, @tsdye will know for sure. Yes I don't really know the needs of contract archaeology in Europe well at all. Perhaps a Harris matrix R package would be a very good use of our collective effort!

@nmueller18 I would love to hear more about how we can identify those projects which are potentially of greater interest so we can evangelize them, and what kinds of evangelism would be most effective.

mpeeples2008 commented 5 years ago

I'd be in favor of some sort of to-do list that people could contribute to if for no other reason than there are a lot of things I wish I had time to do and clean up that would probably be better served by being part of a larger project with more people involved. I think even just compiling common statistical methods and tools with archaeological data/examples in the documentation would be a big help for teaching as some students have trouble connecting to the standard examples given for many common methods.

As @benmarwick suggested, I've pulled together many of the most useful TFQA stuff into a set of R scripts that I use for teaching two graduate methods courses and for my own research. A lot of stuff that the old TFQA did is possible in R already without writing new/elaborate scripts but there are a few things that are more common in archaeology than in other realms that I've gotten a fair bit of use out of (especially various Monte Carlo procedures for assessing distance/similarity metrics and some methods for evaluating and comparing clustering solutions). I've thought about compiling those into a package at various points but I've yet to find the time to dedicate to it. I've documented several of these and they are on my GitHub and here: https://mattpeeples.net/?page_id=656 but I've got a ton more that I've just not had time to document thoroughly enough to release. If anyone was interested in combining these and a few other things into a TFQA package or something similar, I'd be happy to work on that (and having collaborators would be a kick in the pants to just get it done). I've also got a series of chronological tools that I've been wanting to document better and release as a package that could be thrown in. Surprisingly (to me at least) according to Keith Kintigh, TFQA still gets a lot of use and citations and he gets requests for it fairly frequently. Pushing people into a modern format may help spread the use of tools like R and Python in archaeology generally and this suggests there is an audience for something like this.

I agree with @benmarwick in general about the need to keep things reproducible but there are some areas where GUIs can be a useful tool for convenience or for users less comfortable in a programming environment. @ajupton and I have been working on a Shiny App to replace the old GAUSS routines that are used by many for defining and assessing groups in chemical compositional data. This is largely meant to replace the routines used by MURR (Missouri University Research Reactor) and to speed up my own analyses and theirs (we're working directly with MURR on this). In the process we're doing some things to allow for greater reproducibility including incorporating more un-/semi-supervised classification methods and having the Shiny script compile a text file that displays all of the code that was run from raw data to final product. That way, we can have it both ways to an extent (reproducible and GUI)

izaromanowska commented 5 years ago

Hi all, great discussion. From my perspective what the community really need is the same functionality that they usually draw from MS Excel - drawing a simple histogram, t-student, getting the mean value of a column, a splash of PCA for the bone people, etc. That covers about 90% of archaeologists needs and none of this requires any major investment in software development - more like a quick 'cheat sheet' plus a tutorial (@benmarwick the idea we had about a million years ago... maybe we should revisit). The problem now is that going through an online course, or a textbook is highly inefficient for anyone that just needs those basic functionality - you just learn so much stuff that's not massively relevant before you get to the bits that you actually need. Which, let's face it, is mostly visualising artefact counts.

Once you get into more complex analysis - well, here I'd say if you need to use Bayesian calibration on C14 you better know what you're doing and if that requires a moderately decent coding skills then go and get them.

So yes, to do list is a great idea just don't forget to include the very basics.

carlosrellan commented 5 years ago

Sorry for the delay. In my tweet, I was expressing my concern about the fact that there is no R package for dealing with Harris matrices, while there is at least one in Python (ArkMatrix). I agree with @benmarwick that there may not be a high demand for this kind of tools (at least from a publication point of view), but the true is that there are other researchers who share this concern, and –as expressed by @nmueller18 – Harris matrices are a major analytical tool, and having a R package would allow to couple the stratigraphic analysis with other types of analyses provided by other packages.

On the other hand, Archaeologists are not outside the World and many of their needs are shared by researchers from other disciplines. For example, I am interested in metrics, and I really miss packages for analysing citation data and coauthorship networks derived from Academia.edu or Researchgate (as the scholar package does with GScholar). I am sure that a lot of researchers miss this kind of tools when preparing a grant proposal, etc... (especially in Humanities, where WOS misses a lot of stuff).

I believe that this to-do list would be a good idea and it would be useful not only for Archaeologists, but for researchers in other disciplines (e.g. I am pretty sure that a lot of non-archaeologists visit @isaacullah 's list of FOSS tools).

joeroe commented 5 years ago

More generally, I think we need a tidy way to represent and analyse stratigraphic data in R. Harris matrices would a major component of that, but it could also incorporate things like Dye and Buck's work on stratigraphic DAGs, which would link nicely to Bayesian modelling. A good subject for a hackathon perhaps!

tsdye commented 5 years ago

I like the idea of @nevrome to create a to do list in addition to the useful list compiled by @zackbatist . Personally, I wouldn't worry too much about organizing it at the start. An open source list can evolve as patches or pull requests are applied by the maintainer.

In response to the comments about the Harris matrix in R, I think this might be a useful project. The igraph package looks to be mature and full featured. My sense after browsing Kolaczyk and Csardi's book, Statistical Analysis of Network Data in R, is that it would provide a solid base for a DAG implementation similar to my Common Lisp hm package.

Caitlin Buck has a graduate student working on a project to develop a Harris matrix front end for Bayesian calibration software. She'll be describing that effort, which is just underway, at the CAA meeting in Krakow next spring.

In response to @benmarwick the ArchaeoPhases package maintained by Anne Philippe at Nantes includes functions that work on the raw MCMC output from BCal, OxCal, and Chronomodel. It does not help build the chronological model.

BTW, the open source Chronomodel project seems to be developing nicely. The developers provide binaries for Mac and Windows. I've been able to compile both the current stable release (1.5, I believe) and the development version (which ought to be released soon as version 2) on Linux. The UI is really nice, IMO.

zackbatist commented 5 years ago

Okay, I've bit the bullet and pushed a small ToDo.md file to this repository, which includes items that have been mentioned here.

@mpeeples2008 I left the TFQA development item intentionally vague, since I'm unsure of what steps are needed to accomplish your goals. Please submit a pull request or reply to add more specific tasks, or we can link to a page where the development for that project will be undertaken.

Same goes for whoever else wants to expand or provide more detail for whatever they feel like taking a lead on or contributing to.

isaacullah commented 5 years ago

Hi all, sorry I'm a little late to the discussion, and thanks for the nice shout-outs about my blog post. Yes, a lot of non-archaeologists use that list (I frequently get emails), but, reading through the thread here, I'm struck by a couple of things: 1) Do we actuall need or even want uniformity? Isn't that sort of against the idea of open source? We should have options (e.g., R v. SciPy, QGIS v. GRASS), so people can choose the interfaces they like the best, and tools should be made to be the best at the thing they do, and not try to do everything. 2) We should think more about user interfaces. I am not scared by code, but I have been doing it a long time. A lot of my students are frightened to death by it, however, and it takes a lot to get over that. I personally am a BIG fan of widget-based graphical programming interfaces, such as Orange (https://orange.biolab.si/), KNIME (https://www.knime.com/), Flowgarithm (http://www.flowgorithm.org/), AppInventor (http://appinventor.mit.edu/explore/),AgentSheets (http://www.agentsheets.com/), and GRASS's "graphical modeler" (https://grass.osgeo.org/grass77/manuals/wxGUI.gmodeler.html). These are MUCH easier for novices to get into, yet they are fully reproducible workflows that can generate real code. 3) Like it or not, archaeology-specific programs are really niche, and as such the audience is fairly limited. Certain things can be wider reaching, as some analyses are shared by other disciplines (my cumulative viewshed tool in GRASS, for example, is very widely used), whereas say a module to create Harris Matrices wouldn't really be used outside of Archaeology. One thing I think computational and digital archaeologists especially have become really good at is seeing the archaeological applications of softwware and hardware designed for other purposes. Just look at the boom in use of off-the-shelf drones and VRware for 3D mapping and scanning of heritage resources. Archaeologists aren't the main coders or builders, but were are really right up at the forefront of those people figuring out secondary (unintended) use-cases for all of that. 4) To that last point, I really think we need to be sharing generalizable workflows in addition to just code snippets or specific case studies. For example, here's one of my own: http://isaacullah.github.io/Digital-Data-Collection-for-Field-Sciences/

Anyway, just my 2 cents on all this.

Shout out to @mpeeples2008 @benmarwick @carlosrellan @izaromanowska ! Hope to see you all soon!

nevrome commented 5 years ago

Thanks for creating the ToDo-List @zackbatist. And thanks to all of you for these interesting comments. I will go back to the many ideas expressed here in preparation for the roundtable discussion about the SIG SSLA in Krakow. I was not at all aware of tfqa.com!

Some things I wanted to comment on: @benmarwick

I think that archaeology lacks a Lakatosian hard core of theory and method that would be well-served by specialized software packages. For example, most R packages developed by archaeologists are barely used by anyone except the original authors (please prove me wrong!).

While the second statement is true, the first one is questionable. There are domains in archaeology, where particular digital methods are very widespread. The first thing that comes to mind is fieldwork.

I wonder if there is an empirical way to make a todo list?

I admire your approach, but in this case, the first option might be asking. In my computer science classes, they taught me to talk intensively with the "clients" before writing any code.

@tsdye

Personally, I wouldn't worry too much about organizing it at the start. An open source list can evolve as patches or pull requests are applied by the maintainer.

That's a good point. One important side effect of this discussion is that now all of you are aware of the list.

@isaacullah

Do we actuall need or even want uniformity? Isn't that sort of against the idea of open source?

Maybe we should work more on software with simple command line interfaces. Then it's relatively simple to include them in whatever workflow or to write custom APIs in whatever language. Decentrality is nice, but what do we win if we can't use the tools in the end? This ArkMatrix library mentioned by @carlosrellan is an excellent example of this problem.

joeroe commented 5 years ago

@nevrome @nmueller18 @carlosrellan @tsdye Well, curiosity got the better of me and it turns out it's pretty trivial to construct a basic Harris matrix (or DAG) in R using existing tools:

library("tidyverse")
library("tidygraph")
library("ggraph")

harris <- function(strat) {
  to <- c(rep(strat$context, times = map_int(strat$above, length)))
  from <- unlist(strat$above)
  tibble(to, from) %>%
    drop_na() %>%
    return()
}

# Example data after Harris 1979, Fig. 12
harris12 <- tibble(context = c(1:9, "natural"),
                   above = list(NA, 1, 1, 1, c(2, 3, 4), 5, 6, 6, c(7,8), 9),
                   below = list(c(2, 3, 4), 5, c(5, 7), c(5, 8), 6, c(7, 8), 9, 9, "natural", NA),
                   equal = list(NA, NA, NA, NA, NA, NA, 8, 7, NA, NA))

h12_graph <- tbl_graph(nodes = harris12, edges = harris(harris12))

ggraph(h12_graph, layout = "sugiyama") +
  geom_edge_link() +
  geom_node_label(aes(label = context), label.r = unit(0, "mm")) +
  theme_graph()

harris12

The work would be in things like importing different formats, interoperability with similar tools (hm, ArkMatrix, etc.), and plotting in the conventional style. I've started a repository at joeroe/stratigraphr – commits/suggestions/forks welcome!

nevrome commented 5 years ago

@joeroe This is the right spirit!

As far as I remember, the really challenging problem of Harris matrix creation is the logic check. I think stratify has some intelligent algorithms here (@nmueller18?). Do you have this in mind?

We should probably move this discussion over to your new Repo.

benmarwick commented 5 years ago

Yes, what a productive discussion! Great to see a tidy Harris pkg taking shape.

@nevrome I like your idea of asking, we (i.e. CAA CSS & SSLA SIGs, the open archaeology working group, ISAAKiel, and the SAA OSIG) could easily put out a simple survey that asks archaeologists more broadly (i.e. not the dozen or so of us who are in too deep already) where are the pain points in the software they're using, or what do they have a need for.

@gianmarcoalberti has also been quite active writing R code for archaeologists: https://github.com/gianmarcoalberti/GmAMisc/ (looks like there are some TFQA functions in there) and http://cainarchaeology.weebly.com/ and undoubtably has thought a lot about what kinds of software archaeologists need and use.

I love the idea of an event like a hackathon that @nevrome & @joeroe noted to make progress on some of these ideas. I think we could very easily plan for something like this by co-opting a regular conference session, say at the CAA or SAA and instead of giving papers we could make an R package. For example, I think it could be quite a suitable project to tidy the scripts by @mpeeples2008 and @gianmarcoalberti into a tfqar pkg and prepare some vignettes to show beginners how to use them. I've participated in two of ropensci's unconfs that do this kind of thing. They've worked out a great format, really inclusive and friendly, and I think we could use it to make a mini-unconf-in-a-conf for the activities we've been discussing above.