Skip to content

Trove journals

Trove's 'journals' zone includes journals and journal articles, as well as other research outputs and things like press releases. You can access metadata from the journal zone through the Trove API, but to get text and images you need to use some screen scraping.

Binder

Tips, tools, and examples

Create a list of Trove's digitised journals

Everyone know's about Trove's newspapers, but there is also a growing collection of digitised journals available in the journals zone. They're not easy to find, however, which is why I created the Trove Titles web app. This notebook uses the Trove API to harvest metadata relating to digitised journals – or more accurately, journals that are freely available online in a digital form. This includes some born digital publications that are available to view in formats like PDF and MOBI, but excludes some digital journals that have access restrictions.

Get OCRd text from a digitised journal in Trove

Many of the digitised journals available in Trove make OCRd text available for download – one text file for each journal issue. However, while there are records for journals and articles in Trove (and available through the API), there are no records for issues. So how do we find them? This notebook shows how to extract issue data from a digitised journal and download OCRd text for each issue.

Get covers (or any other pages) from a digitised journal in Trove

In another notebook, I showed how to get issue metadata and OCRd texts from a digitised journal in Trove. It's also possible to download page images and PDFs. This notebook shows how to download all the cover images from a specified journal. With some minor modifications you could download any page, or range of pages.

Download the OCRd text for ALL the digitised journals in Trove!

Using the code and data from the previous two notebooks, you can download the OCRd text from every digitised journal. If you're going to try this, you'll need a lots of patience and lots of disk space. Needless to say, don't try this on a cloud service like Binder. Fortunately you don't have to do it yourself, as I've already run the harvest and made all the text files available. See below for details. I repeat, you probably don't want to do this yourself. The point of this notebook is really to document the methodology used to create the repository.

Harvest parliament press releases from Trove

Trove includes more than 370,000 press releases, speeches, and interview transcripts issued by Australian federal politicians and saved by the Parliamentary Library. You can view them all in Trove by searching for nuc:"APAR:PR" in the journals zone. This notebook shows you how to harvest both metadata and full text from a search of the parliamentary press releases. The metadata is available from Trove, but to get the full text we have to go back to the Parliamentary Library's database, ParlInfo.

Harvesting data from the Bulletin

This is a more specific example of harvesting metadata and OCRd text from a digitised journal, in this case The Bulletin. It also shows how you can get the front cover images (or any other page).

Finding editorial cartoons in the Bulletin

In another notebook I showed how you could download all the front pages of The Bulletin (and other journals) as images. Amongst the front pages you'll find a number of full page editorial cartoons under The Bulletin's masthead. But you'll also find that many of the front pages are advertising wrap arounds. The full page editorial cartoons were a consistent feature of The Bulletin for many decades, but they moved around between pages one and eleven. That makes them hard to find. I wanted to try and assemble a collection of all the editorial cartoons, but how?

Harvesting data from Home

This is a more specific example of harvesting metadata and OCRd text from a digitised journal, in this case The Home. It also shows how you can get the front cover images (or any other page).

Topic Modelling of Australian Parliamentary Press Releases by Adel Rahmani

This notebook explores the Politicians talking about 'immigrants' and 'refugees' collection of press releases (see below). Adel notes: 'I was curious about the contents of the press releases, however, at more than 12,000 documents the collection is too overwhelming to read through, so I thought I'd get the computer to do it for me, and use topic modelling to poke aroung the corpus.'

Data and text

CSV formatted list of journals available from Trove in digital form

Harvested: 25 August 2019

This file provides metadata of 2,620 journals that are available from Trove in a digital form. You can download the CSV file.

This file includes the following columns:

  • fulltext_url – the url of the landing page of the digital version of the journal
  • title – the title of the journal
  • trove_id – the 'nla.obj' part of the fulltext_url, a unique identifier for the digital journal
  • trove_url – url of the journal's metadata record in Trove

CSV formatted list of journals with OCRd text

Harvested: 25 August 2019

This file provides metadata of 720 digitised journals in Trove that have OCRd text for download. You can download the CSV file. You can also browse a human-readable list.

This file includes the following columns:

  • fulltext_url – the url of the landing page of the digital version of the journal
  • title – the title of the journal
  • trove_id – the 'nla.obj' part of the fulltext_url, a unique identifier for the digital journal
  • trove_url – url of the journal's metadata record in Trove
  • issues – the number of available issues
  • issues_with_text – the number of issues that OCRd text could be downloaded from
  • directory – the directory in which the files from this journal have been saved (relative to the output directory)

OCRd text from Trove digitised journals

Harvested: 25 August 2019

Using the notebook above I harvested metadata and OCRd text from Trove's digitised journals.

  • 719 journals had OCRd text available for download
  • OCRd text was downloaded from 33,035 journal issues
  • About 8gb of text was downloaded

The complete collection of text files for all the journals can be browsed here and downloaded using this repository on CloudStor.

Editorial cartoons from The Bulletin, 1886 to 1952

Harvested: 9 May 2019

Using the notebook above I downloaded at least one full page editorial cartoon for every issue of The Bulletin from 4 September 1886 to 17 September 1952. In total there are 3,471 images (approximately 60gb). The complete collection can be downloaded from CloudStor. The names of each image file provide useful contextual metadata. For example, the file name 19330412-2774-nla.obj-606969767-7.jpg tells you:

  • 19330412 – the cartoon was published on 12 April 1933
  • 2774 – it was published in issue number 2774
  • nla.obj-606969767 – the Trove identifier for the issue, can be used to make a url eg https://nla.gov.au/nla.obj-606969767
  • 7 – on page 7

To make it easier to browse the images, I've compiled them into a series of PDFs – one PDF for each decade. The PDFs include lower resolution versions of the images together with their publication details and a link to Trove. They're all available from DropBox:

Politicians talking about 'immigrants' and 'refugees'

Using the notebook above I harvested parliamentary press releases that included any of the terms 'immigrant', 'asylum seeker', 'boat people', 'illegal arrivals', or 'boat arrivals'. A total of 12,619 text files were harvested. You can browse the files on CloudStor, or download the complete dataset as a zip file (43mb).