query-e6cc4f5986243c1961518fac90ac1ab6

Coverage of Crossref in Wikidata relies on Wikidata as its source for collating sources, and as such it would be interesting to know how much of Crossref is covered by Wikidata at current. WikiCiteOf the 134 million DOIs in the dump, we filtered the dataset based upon type of record to rule out things that are not likely to be sourced such as journal issues and book series. We filtered the data set to just ["journal-article", "book-chapter", "proceedings-article", "dissertation", "book", "monograph", "dataset"] which left us with 112 million (112,013,354) DOIs. we find that we only need to sample 16.5k DOIs to find out what percentage of the general population can be found in Wikidata to 99% accuracy with a 1% margin of error. However as my script is quick enough I checked 1% or 1.1 million samples from the list. sample size calculatorUsing a which processed all 1.1 million DOIs in 50 minutes. ID to QID script. I then used my StackOverflow as this was recommended on shuf -n 1120133 input.txt > output.txtI created my sample using The script works as follows. The list of DOIs is read into a Python list and split into pages of 125k items. We split into pages in case we have malformed items which would cause the entire process to fail. For each page we iterate through the dump in 100 item sized chunks which are inserted into a SPARQL query like so (only including 10 DOIs for readability):

graph TD classDef projected fill:lightgreen; classDef literal fill:orange; classDef iri fill:yellow; v1("?id"):::projected v2("?item"):::projected v3("?license"):::projected bind0[/VALUES ?id/] bind0-->v1 bind00(["10.4067/S0717-95022018000401439"]) bind00 --> bind0 bind01(["10.1111/J.1471-0528.1982.TB05083.X"]) bind01 --> bind0 bind02(["10.1016/S1297-9589(03)00146-2"]) bind02 --> bind0 bind03(["10.17116/ONKOLOG2020905145"]) bind03 --> bind0 bind04(["10.1177/1368430299024002"]) bind04 --> bind0 bind05(["10.1109/TNET.2008.2011734"]) bind05 --> bind0 bind06(["10.32838/TNU-2663-5941/2020.6-2/03"]) bind06 --> bind0 bind07(["10.1039/C4RA01952K"]) bind07 --> bind0 bind08(["10.7717/PEERJ.11233"]) bind08 --> bind0 bind09(["10.1016/J.ACTAASTRO.2014.11.037"]) bind09 --> bind0 v2 --"wdt:P356"--> v1 subgraph optional0["(optional)"] style optional0 fill:#bbf,stroke-dasharray: 5 5; v2 -."wdt:P275".-> v3 end

Use at

Query found at