Open karino2 opened 6 years ago
Citation, office action, rejection https://bulkdata.uspto.gov/data/patent/office/actions/bigdata/2017/
These csv contains app_id which is id for patent.
app_id citation_pat_pgpub_id parsed ifw_number action_type action_subtype form892 form1449 citation_in_oa
12000863 5709227 5709227 NaN NaN NaN 0
GUESS
Patent data itself seems in
https://bulkdata.uspto.gov/#pats "Patent Grant Full Text Data (No Images) (JAN 1976 - PRESENT)" ?
doc-number seems to stand for app_id in citations.csv?
Try getting https://bulkdata.uspto.gov/data/patent/grant/redbook/fulltext/2018/ and investigate.
Create wget script. https://colab.research.google.com/drive/1gJO0VQ72Xfr_SY2dwWyiTvczIv0_gd4L
and call wget for all zip in sh. About 100MB for each zip file. Total 34 zip files.
Total Zip files size is about 3.9GB.
Download each zip file takes about 2min 20sec.
Total download time 1h 16min
Unzip all files and total size becomes 23GB
Try ag for first app_id in citations.csv.
> time ag 12000001
...
real 4m22.382s
user 0m21.800s
sys 0m7.736s
Too slow for investigation. Also, ipg*.xml does not contains this id.
Just take all 2018 seems not enough...
Both 12000001 and 20060218340 does not match to any doc-number in *.xml. What should I do?
The id12000001 meanse its application is 2007. The rule is below.
https://patentlibrarian.com/reference-library/u-s-application-series-codes/
Okazawa-san says we should use "application" data instead of "grant" data.
Try one file of application/2017.
tail -n 10000 ipa170105.xml | grep "doc-number"
<doc-number>14545869</doc-number>
<doc-number>20170006754</doc-number>
<doc-number>14756094</doc-number>
<doc-number>14-8409</doc-number>
<doc-number>20170006755</doc-number>
<doc-number>14545879</doc-number>
<doc-number>20170006756</doc-number>
<doc-number>14545877</doc-number>
Some doc-number seems to match app-id.
13992060 seems to match both files.
_@instance-ml:~/patent/data$ ag 13992060
citations.csv
41841925:13992060,4725182,4725182,,,,1,0,0
...
41841942:13992060,20080147232,20080147232,,,,0,1,0
41841943:13992060,20110232082,20110232082,IDFY9T5YPXXIFW4,103,a,1,0,1
41841944:13992060,7200922,7200922,IK7E6A9JRXEAPX1,103,a,1,0,1
ipa170105.xml
6585748:<doc-number>13992060</doc-number>
> head office_action.csv
app_id,ifw_number,document_cd,mail_dt,art_unit,uspc_class,uspc_subclass,header_missing,fp_missing,rejection_fp_mismatch,closing_missing,rejection_101,rejection_102,rejection_103,rejection_112,rejection_dp,objection,allowed_claims,cite102_gt1,cite103_gt3,cite103_eq1,cite103_max,signature_type
14150981,100867762,CTFR,2015-10-15,2632,375,219000,0,0,0,0,0,0,1,0,0,0,0,0,0,1,2,1
14198961,100867788,CTFR,2015-10-15,2699,345,173000,0,0,0,0,0,1,1,0,0,0,0,0,0,0,2,1
13796589,100867794,CTNF,2015-10-15,3776,606,159000,0,0,0,0,0,0,1,0,0,0,0,0,0,0,3,3
101 Rejection – Subject Matter Eligibility, Statutory Double Patenting, Utility, etc 102 Rejection – Lack of Novelty 103 Rejection – Obviousness
Office action contains ifw_number, which is the document id of office action. Citation also contains ifw_number which is the document that this citation occur.
Also, patent id prefix is assigned based on application year. https://patentlibrarian.com/reference-library/u-s-application-series-codes/
For example, app_id starts with 13 means it applied during Dec. 2010 – Aug. 2013.
app_id, USPTO Series Codes (up to 2013) https://patentlibrarian.com/reference-library/u-s-application-series-codes/
e.g., app_id 12xxxxxx is from Dec. 6, 2007 to Mar. 2011.
doc-number patterns:
Download whole grants in 2012. https://bulkdata.uspto.gov/data/patent/grant/redbook/fulltext/2012/
Takes 95min. Size 4.4GB before uncompress.
app_id prefix
patent number published in 2012 is in range about 8090000 - 8340000
Application: app-id which apply at 2015-06-25 is 14/750000 app-id which apply at 2016-06-01 is 15/170000
These are rough range of publication between 2017-01-25 to 2017-12-01 (year 2017)
14/730000-15/224600 probably published between 2016-12-03 to 2018-01-31
2011 grants: 7862000 - 8087000
doc-number of grants
6745 rows, 5424 unique parsed, 3102 unique app_id
3102 xml is retrieved. 1 by 1 relation of df and retrieved xml.
Retrieve 3083 xml.. unique app_id is 3077. Some are the same app_id with different xml (as we expected). 25 applications seems lost from above df.
lost app_id
{13687937,
13757778,
14046229,
14223855,
14494367,
14494415,
14577580,
14864955,
14872276,
14872315,
14872716,
14873160,
14873292,
14873845,
14874591,
14874615,
14875024,
14876632,
14896485,
15127826,
15144598,
15200023,
15207409,
15208151,
15331960}
Try and error note for understanding data.
May be merge to #1