qwikidata.json_dump module

Module for Wikidata JSON dumps.

class WikidataJsonDump(filename)[source]

Bases: object

Class for Wikidata JSON dump files.

Represents a json file from https://dumps.wikimedia.org/wikidatawiki/entities. File names are of the form “wikidata-YYYYMMDD-all.json[.bz2|.gz]”. The file is a single JSON array and there is one element (i.e. item or property) on each line with the first and last lines being the opening and closing square brackets. This class can handle bz2 or gz compressed files as well as the uncompressed json files.

Parameters

filename (str) – The wikidata JSON dump file name (e.g. my_data_dir/wikidata-20180730-all.json.bz2)

create_chunks(out_fbase=None, num_lines_per_chunk=100, max_chunks=10000000000)[source]

Produce N files with num_lines_per_chunk wikidata items per file.

Parameters
  • out_fbase (str) – Each output file will have the form {out_fbase}_ichunk_{ichunk}.json[.bz2|.gz]

  • num_lines_per_chunk (int) – Number of lines per chunk file

  • max_chunks (int) – Maximum number of chunks to write

Return type

List[str]

iter_lines()[source]

Generate lines from JSON dump file.

Return type

Iterator[str]