query-e6cc4f5986243c1961518fac90ac1ab6
Coverage of Crossref in Wikidata relies on Wikidata as its source for collating sources, and as such it would be interesting to know how much of Crossref is covered by Wikidata at current. WikiCiteOf the 134 million DOIs in the dump, we filtered the dataset based upon type of record to rule out things that are not likely to be sourced such as journal issues and book series. We filtered the data set to just ["journal-article", "book-chapter", "proceedings-article", "dissertation", "book", "monograph", "dataset"] which left us with 112 million (112,013,354) DOIs. we find that we only need to sample 16.5k DOIs to find out what percentage of the general population can be found in Wikidata to 99% accuracy with a 1% margin of error. However as my script is quick enough I checked 1% or 1.1 million samples from the list. sample size calculatorUsing a which processed all 1.1 million DOIs in 50 minutes. ID to QID script. I then used my StackOverflow as this was recommended on shuf -n 1120133 input.txt > output.txtI created my sample using The script works as follows. The list of DOIs is read into a Python list and split into pages of 125k items. We split into pages in case we have malformed items which would cause the entire process to fail. For each page we iterate through the dump in 100 item sized chunks which are inserted into a SPARQL query like so (only including 10 DOIs for readability):
Use at
- https://query.wikidata.org/sparql
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
SELECT ?item ?id ?license {
VALUES ?id { '10.4067/S0717-95022018000401439' '10.1111/J.1471-0528.1982.TB05083.X'
'10.1016/S1297-9589(03)00146-2' '10.17116/ONKOLOG2020905145' '10.1177/1368430299024002' '10.1109/TNET.2008.2011734'
'10.32838/TNU-2663-5941/2020.6-2/03' '10.1039/C4RA01952K' '10.7717/PEERJ.11233' '10.1016/J.ACTAASTRO.2014.11.037' }
?item wdt:P356 ?id .
optional {?item wdt:P275 ?license}
}