zavtech / morpheus-core

The foundational library of the Morpheus data science framework
Apache License 2.0
238 stars 23 forks source link

Create DataFrames From Java Collections #88

Open dgunning opened 6 years ago

dgunning commented 6 years ago

Java Developers need a easy way to create Dataframes from in-memory Java Collections. This will Morpheus much more suitable for generic Java development.

I am proposing a class called ListSource or CollectionSource that would be able to create a DataFrame from a List of Lists. E.g. Lets say you read a table from a Word document

XWPFDocument document = new XWPFDocument(stream); XWPFTable table= document.getTables().get(0);

and you convert the table to a lists of Iterables (or lists)

 List<Iterable<XWPFTableCell>> tableData =
                    table.getRows().stream()
                    .map( XWPFTableRow::getTableCells).collect(Collectors.toList());

you could then create a dataframe as follows

  DataFrame<Integer,String> data = new ListSource<XWPFTableCell>()
           .read(options ->{
                options.setData( tableData );
                options.setConverter( XWPFTableCell::getText );
            });

Generally a lot of data in Java can be converted to Lists of Lists and this feature would make Morpheus much more applicable.

Note that the current Morpheus API allows the following

        final Array<String> columns = Array.ofIterable( rows.get(0).getTableCells().stream()
          .map( XWPFTableCell::getText ).collect(toList()));

        return DataFrame.ofObjects(
                Range.of(1, rows.size()).toArray(),
                columns,
                value -> rows.get( value.rowOrdinal()+1).getTableCells().get(value.colOrdinal()).getText());

but that was trickier to get right due to the long method chains and the +1 in the method calls

dgunning commented 6 years ago

This approach is generally applicable for many unconventional datasources especially if we add a new TableAdapter utility class that takes a raw table and returns a List

//Get Canada's Investor alert's page and find the table with the alerts
  Document doc = Jsoup.connect("https://www.securities-administrators.ca/InvestorAlerts.aspx/").get();
        Element table = doc.getElementById("ctl00_bodyContent_InvestorAlertSearchControl1_InvestorAlertListControl1_GridView_List");

// Convert to List of ELements
        List<Iterable<Element>> tableData =
                new TableAdapter<Element, Element, Element>()
                        .adapt(table,
                                tableElement -> tableElement.getElementsByTag("tr"),
                                rowElement -> {
                                    Elements tdOrTh = rowElement.getElementsByTag("td");
                                    return !tdOrTh.isEmpty() ? tdOrTh : rowElement.getElementsByTag("th");
                                });

Create dataframe

        DataFrame<Integer,String> data = new ListSource<Element>().read(options ->{
            options.setData( tableData );
            options.setConverter( e ->  e.wholeText().trim() );
        });