zipline-live / zipline

Zipline-Live, a Pythonic Algorithmic Trading Library
http://www.zipline-live.io/
Apache License 2.0
394 stars 65 forks source link

Enable Fundamentals data support #42

Open alphaville76 opened 6 years ago

alphaville76 commented 6 years ago

I know, that you already planned to support fundamentals data.

I just wanted to point your attention to a small project I started sometime ago to provide fundamentals data support to zipline: https://github.com/alphaville76/Fundamentals

I'd also like to help in implementing this feature, in zipline-live: let me know, how I could help.

Last but not least: An important question is how to feed the data. Of course it would be great to continue to support Morningstar as source, but I don't know how expensive their data are and how they could be licensed to individual investors.

bartosh commented 6 years ago

I have a code that ingests fundamental data into the bundle from csv files in a format name,date,value

I can share it if the approach makes sense.

vprelovac commented 6 years ago

I have contacted Morningstar for a quote and will report back on the findings.

tibkiss commented 6 years ago

@alphaville76 : Your work on fundamental looks really nice, thanks for sharing!

In order to not to deviate too much from Q's interface I'd propose that we provide a pipeline-aware interface: similar to USEquityPricing or morningstar package at Q.

The UsEquityPricing & UsEquityPricingLoader classes could be taken as an example for this workitem.

Do you guys agree to this approach?

alphaville76 commented 6 years ago

I agree, we should deviate from the original Q's interface as less as possible.

Give a look to https://github.com/alphaville76/Fundamentals/blob/master/query_data.py It's exaclty the same API like Q's get_fundamentals, even for the field names.

I could not use Morningstar as datasource, but I mapped the data from SHARADAR_SF1 (https://www.quandl.com/data/SF1-Core-US-Fundamentals-Data) to the Mornigstar naming, when possible. This happens in map_row_to_fundamentals: https://github.com/alphaville76/Fundamentals/blob/master/insert_data.py

alphaville76 commented 6 years ago

Some other information:

alphaville76 commented 6 years ago

@vprelovac Very good! To continue using Mornigstar it the best thing for a lot of reason!

bartosh commented 6 years ago

Here is my code that i mentioned yesterday: https://github.com/bartosh/zipline/commits/fundamentals

The approach is quite generic as the data is loaded from simply structured cvs files and ingested into the bundle. The data can be used in Pipeline.

alphaville76 commented 6 years ago

Well, we already have a basis to implement both the get_fundamentals and Pipeline API!

@bartosh Where did you get the fundamentals data to ingest? It's important to consider that the datasource must also contains delisted companies to avoid survivor bias and the data should not be restated.

bartosh commented 6 years ago

@alphaville76 I used set of fundamental datasets from quandl.

However, I don't think we should be attached to any data provider. That's why i used very simple data structure: cvs files with only 3 fields: name, date, value. Almost anything can be easily converted to this format. This approach would save us a lot of time if we choose to use it. Instead of developing support for multiple data provider formats we can support only one. Converting more complex formats to the simple format is much more easy task than developing support for them, I believe.

alphaville76 commented 6 years ago

@bartosh and how do you pass the sid/ticker? Is it the file name? Could you provide some sample data to ingest? And what about the naming to use the data in a Pipeline, for example what does morningstar.valuation_ratios.pe_ratio become using the Quandl dataset?

An useful approch could be the possibility to define alias for the field names, I'd like to run in zipline-live as much as possible the same code as in Quantopian, included the names of the fundamental data.

bartosh commented 6 years ago

yes, ticker is a file name, e.g. JNJ.csv contains all available fundametal data for JNJ.

Here is an example of earnings/basic share and total assets data for JNJ. I took this data from example datasets on quandl.com name,date,value EPS_MRQ,2017-07-02,1.42 EPS_MRQ,2017-04-02,1.63 EPS_MRQ,2017-01-01,1.46 EPS_MRQ,2016-10-02,1.56 EPS_MRQ,2016-07-03,1.46 EPS_MRQ,2016-04-03,1.62 EPS_MRQ,2016-01-03,1.17 EPS_MRQ,2015-09-27,1.21 EPS_MRQ,2015-06-28,1.63 EPS_MRQ,2015-03-29,1.55 ASSETS_MRQ,2017-07-02,152807000000.0 ASSETS_MRQ,2017-04-02,144918000000.0 ASSETS_MRQ,2017-01-01,141208000000.0 ASSETS_MRQ,2016-10-02,140369000000.0 ASSETS_MRQ,2016-07-03,139814000000.0 ASSETS_MRQ,2016-04-03,136230999999.99998 ASSETS_MRQ,2016-01-03,133410999999.99998 ASSETS_MRQ,2015-09-27,133266000000.0 ASSETS_MRQ,2015-06-28,132036000000.0 ASSETS_MRQ,2015-03-29,128590000000.0

I don't see any need to define aliases. you can just put whatever name you want into the csv.

Regarding usage of this data I'm afraid it would require additional code. You can read detailed explanations here: https://www.quantopian.com/posts/possible-to-simulate-inputs-to-pipeline-in-the-research-platform#56e6e2713008a94b5200058f and here https://github.com/quantopian/zipline/issues/911

alphaville76 commented 6 years ago

I see, it's of course a meaningful approach for Pipeline but it makes difficult to use the same database to replicate the get_fundamentals SQLAlchemy API.

To leverage the power of SQLAlchemy my project's data model takes the field name as single column and the financial statement as table. Take a look at https://github.com/alphaville76/Fundamentals/blob/master/core/model.py

I know, in this way we are anchored to the M data format but there also also some vantages, like 100% compatibility with Quantopian, simples queries and, I believe, also better performance. By the way, gurufocus.com provides M data (fundamentals and daily prices) for a reasonable fee. There both an XLS export and JSON API: https://www.gurufocus.com/api.php

I confirm that I am willing to make a commitment to help developing this feature, but a maintainer of zipline-live should coordinate our efforts and decide how to what to do.

alphaville76 commented 6 years ago

Well, the get_fundamental API is now official deprecated on Quantopian: https://www.quantopian.com/posts/faster-fundamental-data#59a1f4273cdcc8000dca608f

I think, we have to concentrate our effort only on the Pipeline API.

@bartosh In my opinion, you approach is fine. The only problem was the integration with get_fundamentals... but now this argument is gone. Could you merge your code or there is something else, that kept you from doing that?

@tibkiss You wrote on the Q forum, tha M is cheaper than Quantconnect. Did you mean a subscription the M Premium membership or to the Equity API (http://equityapi.morningstar.com/)?

bartosh commented 6 years ago

@alphaville76 We still need to decide what would be the right way to go. My approach generalises data source, but it's quite verbose and requires additional code in the algorithm. Yours is bound to one data provider, but allows to use Q algos without modifications at least as far as I understood.

Can we come up with something that is not bound to M* or another provider and still allows to use unmodified Q algos?

@tibkiss It would be great to know your opinion on this as well.

bartosh commented 6 years ago

JFYI: Q is going to change fundamentals API: https://www.quantopian.com/posts/faster-fundamental-data

tibkiss commented 6 years ago

Sorry for not coming back earlier on this.

@alphaville76 : I was referring to morningstar's financial's webpage at quantopian: http://financials.morningstar.com/

To me compatibility is something really important. Extending beyond Q's capabilities is fine, but an algo taken from Q's website should work out of the box.

Can we achieve this by taking Ed's generalized CSV reader, and build the pipeline.data.Fundamentals field dynamically, based on the content of the CSV? With a well formatted CSV file we could provide the same fields as Q.

I'm sorry if this is a bad idea, I did not had time to try out your code.

alphaville76 commented 6 years ago

@tibkiss Thanks for the reply. Also for me, it's compatibility very important, therefore as I wrote above we could abandon my get_fundamentals project because the API will be removed by Q in a month.

I agree to take Ed's generalized CSV reader and improve his approach. @bartosh What has to be done, in order to avoid the additional code in the algorithm?

A big question remain the data. For example the problem with http://financials.morningstar.com/ is that the "filing date" field is not available, and this is bad for the backtesting. Delisted stocks are also not available. Gurufocus has "filing date" and delisted stocks, but the data are restated and not "as reported" making them also not suitable to backtesting... SHARADAR_SF1 (https://www.quandl.com/data/SF1-Core-US-Fundamentals-Data) has everything but no pricing data for delisted stocks :-(

Anywaly for the beginning and only for live trading financials.morningstar.com should be okay. I've to check if they are updated timely after company filings.

tibkiss commented 6 years ago

Even though I recommended financials.morningstar, I don't think it is the best option. I wasn't able to find a proper API for that site, people are accessing it through web scraping -which may broke if they change the page structure.

Regarding SF1: It is really bad that they do not include delisted ones. Nonetheless for live trading it should be 'okay', as we'll be considering tradable stocks anyways.

@pbharrin : Do you have an opinion about this?

pbharrin commented 6 years ago

@alphaville76 Does Shardar_SF1 have any data for delisted stocks? It seems from your comment that they don't have pricing (OHLCV), but we are interested in fundamental data not pricing. Did I misunderstand your statement? BTW, thanks for your help on fundamentals.

@tibkiss Morningstar doesn't even provide data downloads in .csv?

<rant>I actually don't like Q's approach of putting fundamental data in the pipeline as you are effectively oversampling the data ~60x. (The fundamentals only change 4x per year but you have to get 252 points.) It would be a mute point if you can just throw hardware at the problem, but it is actually quite limiting. Up until the recent speedup you couldn't' get more than a year's worth of fundamental data in an algorithm. I had to do some crazy shit to calculate 5 year growth metrics in Research. </rant>

I see the logic of being compatible with Quantopian, go for it. The interface should be fairly generic so that any source of fundamental data can be used. I like Ed's flat files.

alphaville76 commented 6 years ago

Shardar_SF1 has "as reported" fundamentals data also for delisted stocks. The problem is getting the pricing (OHLCV) for symbols that don't exist anymore and thus perform a full backtest.

For example the gurufocus API provides both financial and pricing data, and also for delisted companies but unfortunatly the financial data are "restated". I wrote them, to know if the could extend their service to offer also the data "as reported" (that they're already using for their own backtester).

I also like Ed's flat files: it's the simplest and most flexible form for ingesting the data. An I also don't like the Q's approach: for example for TTM data, you have set windows_lenght=196 and then sum the data with index 1, 65, 130 and 195. I've an API suggestion: Why don't add a timeframe for the CustomFactor? for example to get the last 4 quarters, you could simply set windows_lenght=196 and timeframe=quarterly. So we remain compatible with Quantopian (default timeframe=daily), at least in one direction.

pbharrin commented 6 years ago

@alphaville76 The quantopian-quandl bundle has pricing for delisted tickers. In the past I have paid for the Zacks historical pricing data that also has delisted tickers. (The problem with Zacks is that they don't tell you when dividends are paid.) Can we not get the pricing data from the quantopian-quandl bundle?

alphaville76 commented 6 years ago

@pbharrin Maybe there are some delisted stocks, but I tried some tickers without success. For example FABU (FAB Universal Corp, delisted from the Nasdaq in 2014): it had 3 other tickers in the Shardar_SF1 (FU,WZD,WZE) but none of them is in the quantopian-quandl bundle.

The same for Yongye International Inc (YONG).

The Zacks ZEP dataset (that costs $300 a quarter) has both: https://www.quandl.com/data/ZEP/FABU-FAB-UNIVL-CP-FABU-Stock-Price https://www.quandl.com/data/ZEP/YONG-Yongye-International-Inc-YONG-Stock-Price-Delisted

alphaville76 commented 6 years ago

As I wrote also on the Google Group, finally provides the Vendor of SF1 also a dataset with historical prices, included delisted companies: https://www.quandl.com/databases/SEP

Using his SF1 and SEP Quandl databases is the lowest-cost solution for reliable backtesting data!

@pbharrin Only now I've seen your post "How to Use Fundamental Data With Zipline" (http://alphacompiler.com/blog/6/). Nice Job!