Skip to content

Trove journals

Trove's 'journals' zone includes journals and journal articles, as well as other research outputs and things like press releases. You can access metadata from the book zone through the Trove API.

Binder

Tips, tools, and examples

  • Create a list of Trove's digitised journals
    Everyone know's about Trove's newspapers, but there is also a growing collection of digitised journals available in the journals zone. They're not easy to find, however, which is why I created the Trove Titles web app. This notebook uses the Trove API to harvest metadata relating to digitised journals – or more accurately, journals that are freely available online in a digital form. This includes some born digital publications that are available to view in formats like PDF and MOBI, but excludes some digital journals that have access restrictions.

  • Get OCRd text from a digitised journal in Trove
    Many of the digitised journals available in Trove make OCRd text available for download – one text file for each journal issue. However, while there are records for journals and articles in Trove (and available through the API), there are no records for issues. So how do we find them? This notebook shows how to extract issue data from a digitised journal and download OCRd text for each issue.

  • Download the OCRd text for ALL the digitised journals in Trove!
    Using the code and data from the previous two notebooks, you can download the OCRd text from every digitised journal. If you're going to try this, you'll need a lots of patience and lots of disk space. Needless to say, don't try this on a cloud service like Binder. Fortunately you don't have to do it yourself, as I've already run the harvest and made all the text files available. See below for details. I repeat, you probably don't want to do this yourself. The point of this notebook is really to document the methodology used to create the repository.

  • Harvest parliament press releases from Trove
    Trove includes more than 370,000 press releases, speeches, and interview transcripts issued by Australian federal politicians and saved by the Parliamentary Library. You can view them all in Trove by searching for nuc:"APAR:PR" in the journals zone. This notebook shows you how to harvest both metadata and full text from a search of the parliamentary press releases. The metadata is available from Trove, but to get the full text we have to go back to the Parliamentary Library's database, ParlInfo.

  • Harvesting data from the Bulletin
    This is a more specific example of harvesting metadata and OCRd text from a digitised journal, in this case The Bulletin. It also shows how you can get the front cover images (or any other page).

  • Finding editorial cartoons in the Bulletin
    In another notebook I showed how you could download all the front pages of The Bulletin (and other journals) as images. Amongst the front pages you'll find a number of full page editorial cartoons under The Bulletin's masthead. But you'll also find that many of the front pages are advertising wrap arounds. The full page editorial cartoons were a consistent feature of The Bulletin for many decades, but they moved around between pages one and eleven. That makes them hard to find. I wanted to try and assemble a collection of all the editorial cartoons, but how?

  • Harvesting data from Home¶
    This is a more specific example of harvesting metadata and OCRd text from a digitised journal, in this case The Home. It also shows how you can get the front cover images (or any other page).

Data and text

  • CSV formatted list of journals freely available from Trove in digital form (21 April 2019)
    This file provides metadata of 2,024 journals that are freely available from Trove in a digital form. You can download the CSV file.

    This file includes the following columns:

    • fulltext_url – the url of the landing page of the digital version of the journal
    • title – the title of the journal
    • trove_id – the 'nla.obj' part of the fulltext_url, a unique identifier for the digital journal
    • trove_url – url of the journal's metadata record in Trove
  • CSV formatted list of journals with OCRd text (21 April 2019)
    This file provides metadata of 358 digitised journals in Trove that have OCRd text for download. You can download the CSV file.

    This file includes the following columns:

    • fulltext_url – the url of the landing page of the digital version of the journal
    • title – the title of the journal
    • trove_id – the 'nla.obj' part of the fulltext_url, a unique identifier for the digital journal
    • trove_url – url of the journal's metadata record in Trove
    • issues – the number of available issues
    • issues_with_text – the number of issues that OCRd text could be downloaded from
    • directory – the directory in which the files from this journal have been saved (relative to the output directory)
  • OCRd text from Trove digitised journals (21 April 2019)
    Using the notebook above I harvested metadata and OCRd text from Trove's digitised journals.

    • 358 journals had OCRd text available for download
    • OCRd text was downloaded from 27,426 journal issues
    • About 6.6gb of text was downloaded

    The complete collection of text files for all the journals can be browsed and downloaded using this repository on CloudStor.

  • Editorial cartoons from The Bulletin, 1886 to 1952 (9 May 2019) Using the notebook above I downloaded at least one full page editorial cartoon for every issue of The Bulletin from 4 September 1886 to 17 September 1952. In total there are 3,471 images (approximately 60gb). The complete collection can be downloaded from CloudStor. The names of each image file provide useful contextual metadata. For example, the file name 19330412-2774-nla.obj-606969767-7.jpg tells you:

    • 19330412 – the cartoon was published on 12 April 1933
    • 2774 – it was published in issue number 2774
    • nla.obj-606969767 – the Trove identifier for the issue, can be used to make a url eg https://nla.gov.au/nla.obj-606969767
    • 7 – on page 7

    To make it easier to browse the images, I've compiled them into a series of PDFs – one PDF for each decade. The PDFs include lower resolution versions of the images together with their publication details and a link to Trove. They're all available from DropBox:

  • Politicians talking about 'immigrants' and 'refugees'
    Using the notebook above I harvested parliamentary press releases that included any of the terms 'immigrant', 'asylum seeker', 'boat people', 'illegal arrivals', or 'boat arrivals'. A total of 12,619 text files were harvested. You can browse the files on CloudStor, or download the complete dataset as a zip file (43mb).