yoheikikuta / US-patent-analysis

This is a repository of the analysis of US patent.
8 stars 8 forks source link

Try and error to understand data #2

Open karino2 opened 6 years ago

karino2 commented 6 years ago

Try and error note for understanding data.

May be merge to #1

karino2 commented 6 years ago

Citation, office action, rejection https://bulkdata.uspto.gov/data/patent/office/actions/bigdata/2017/

These csv contains app_id which is id for patent.

app_id    citation_pat_pgpub_id    parsed    ifw_number    action_type    action_subtype    form892    form1449 citation_in_oa
12000863    5709227    5709227    NaN    NaN    NaN    0

GUESS

karino2 commented 6 years ago

Patent data itself seems in

https://bulkdata.uspto.gov/#pats "Patent Grant Full Text Data (No Images) (JAN 1976 - PRESENT)" ?

doc-number seems to stand for app_id in citations.csv?

Try getting https://bulkdata.uspto.gov/data/patent/grant/redbook/fulltext/2018/ and investigate.

karino2 commented 6 years ago

Create wget script. https://colab.research.google.com/drive/1gJO0VQ72Xfr_SY2dwWyiTvczIv0_gd4L

and call wget for all zip in sh. About 100MB for each zip file. Total 34 zip files.

Total Zip files size is about 3.9GB.

Download each zip file takes about 2min 20sec.

Total download time 1h 16min

Unzip all files and total size becomes 23GB

karino2 commented 6 years ago

Try ag for first app_id in citations.csv.

> time ag 12000001
...
real    4m22.382s
user    0m21.800s
sys     0m7.736s

Too slow for investigation. Also, ipg*.xml does not contains this id.

Just take all 2018 seems not enough...

karino2 commented 6 years ago

Both 12000001 and 20060218340 does not match to any doc-number in *.xml. What should I do?

karino2 commented 6 years ago

The id12000001 meanse its application is 2007. The rule is below.

https://patentlibrarian.com/reference-library/u-s-application-series-codes/

Okazawa-san says we should use "application" data instead of "grant" data.

karino2 commented 6 years ago

Try one file of application/2017.

 tail -n 10000 ipa170105.xml | grep "doc-number"
<doc-number>14545869</doc-number>
<doc-number>20170006754</doc-number>
<doc-number>14756094</doc-number>
<doc-number>14-8409</doc-number>
<doc-number>20170006755</doc-number>
<doc-number>14545879</doc-number>
<doc-number>20170006756</doc-number>
<doc-number>14545877</doc-number>

Some doc-number seems to match app-id.

karino2 commented 6 years ago

13992060 seems to match both files.

_@instance-ml:~/patent/data$ ag 13992060
citations.csv
41841925:13992060,4725182,4725182,,,,1,0,0
...
41841942:13992060,20080147232,20080147232,,,,0,1,0
41841943:13992060,20110232082,20110232082,IDFY9T5YPXXIFW4,103,a,1,0,1
41841944:13992060,7200922,7200922,IK7E6A9JRXEAPX1,103,a,1,0,1

ipa170105.xml
6585748:<doc-number>13992060</doc-number>
karino2 commented 6 years ago

What is office action? https://bulkdata.uspto.gov/data/patent/office/actions/bigdata/2017/USPTO%20Patent%20Prosecution%20Research%20Data_Unlocking%20Office%20Action%20Traits.pdf

karino2 commented 6 years ago
> head office_action.csv
app_id,ifw_number,document_cd,mail_dt,art_unit,uspc_class,uspc_subclass,header_missing,fp_missing,rejection_fp_mismatch,closing_missing,rejection_101,rejection_102,rejection_103,rejection_112,rejection_dp,objection,allowed_claims,cite102_gt1,cite103_gt3,cite103_eq1,cite103_max,signature_type
14150981,100867762,CTFR,2015-10-15,2632,375,219000,0,0,0,0,0,0,1,0,0,0,0,0,0,1,2,1
14198961,100867788,CTFR,2015-10-15,2699,345,173000,0,0,0,0,0,1,1,0,0,0,0,0,0,0,2,1
13796589,100867794,CTNF,2015-10-15,3776,606,159000,0,0,0,0,0,0,1,0,0,0,0,0,0,0,3,3

101 Rejection – Subject Matter Eligibility, Statutory Double Patenting, Utility, etc 102 Rejection – Lack of Novelty 103 Rejection – Obviousness

karino2 commented 6 years ago

Office action contains ifw_number, which is the document id of office action. Citation also contains ifw_number which is the document that this citation occur.

Also, patent id prefix is assigned based on application year. https://patentlibrarian.com/reference-library/u-s-application-series-codes/

For example, app_id starts with 13 means it applied during Dec. 2010 – Aug. 2013.

yoheikikuta commented 6 years ago

app_id, USPTO Series Codes (up to 2013) https://patentlibrarian.com/reference-library/u-s-application-series-codes/

e.g., app_id 12xxxxxx is from Dec. 6, 2007 to Mar. 2011.

yoheikikuta commented 6 years ago

doc-number patterns:

karino2 commented 6 years ago

Download whole grants in 2012. https://bulkdata.uspto.gov/data/patent/grant/redbook/fulltext/2012/

Takes 95min. Size 4.4GB before uncompress.

karino2 commented 6 years ago

app_id prefix

karino2 commented 6 years ago

patent number published in 2012 is in range about 8090000 - 8340000

Application: app-id which apply at 2015-06-25 is 14/750000 app-id which apply at 2016-06-01 is 15/170000

These are rough range of publication between 2017-01-25 to 2017-12-01 (year 2017)

14/730000-15/224600 probably published between 2016-12-03 to 2018-01-31

karino2 commented 6 years ago

2011 grants: 7862000 - 8087000

karino2 commented 6 years ago

doc-number of grants

karino2 commented 6 years ago
  1. merge to office_actions.csv and citations.csv
  2. filter by grant2012_all.csv and application2017_all.csv

6745 rows, 5424 unique parsed, 3102 unique app_id

karino2 commented 6 years ago

3102 xml is retrieved. 1 by 1 relation of df and retrieved xml.

karino2 commented 6 years ago

Retrieve 3083 xml.. unique app_id is 3077. Some are the same app_id with different xml (as we expected). 25 applications seems lost from above df.

lost app_id

{13687937,
 13757778,
 14046229,
 14223855,
 14494367,
 14494415,
 14577580,
 14864955,
 14872276,
 14872315,
 14872716,
 14873160,
 14873292,
 14873845,
 14874591,
 14874615,
 14875024,
 14876632,
 14896485,
 15127826,
 15144598,
 15200023,
 15207409,
 15208151,
 15331960}
karino2 commented 6 years ago