The British Film Institute (BFI) is one of the world’s oldest and largest moving image archives. Stephen McConnachie, BFI’s head of data and digital preservation, talks to FEED about building an archive, open source formats and digitising 100,000 titles in five years
FEED: What brought you to the film preservation wing of the BFI?
STEPHEN McCONNACHIE: My bachelor’s degree was in English Literature – a long time ago. I got interested in film and started to write about how film and literature align – or don’t. So my masters degree was in film and television studies at Warwick University.
My first main job after university was in Berkhamsted at the BFI National Archive. The job was looking at unexamined collections that hadn’t been documented and weren’t well understood. My job was literally opening film cans and examining the contents, playing them and coming to some decisions about whether they would be put into the national collection or not. If they were, we had to document and assess them.
After a few years, I got interested in the systems. A new system was developed for that work, and that was the trigger for phase two of my career, when I became more interested in the systems than the films.
FEED: What kinds of new systems were you developing?
SM: The systems we created were about enabling the people who opened the can to document the film in the database. The person handling the film became the cataloguer of the film. It was a new model. It’s quite common now, but at the time it was an exciting new thing. Before that, there had been a workflow where someone would be the film examiner, they’d pass the details to another person who would be the cataloguer, and so on. The in-house BFI system was developed to let us do all of that directly. It set off a light bulb in my head, and I realised I was interested in systems and data, and system architecture.
FEED: So what did you do after that?
SM: I left the BFI and went to television, to the ITN Archive for TV news. TV was ahead of film at that point in systems development. I got to be in the first wave of digital asset management and digital preservation systems that were being rolled out in the TV world. I was heavily involved in systems development for data creation, cataloguing and documentation – also with web access, so the data could feed web applications – and I was involved in digital preservation with data tape libraries that were being deployed to store news footage.
I was there for nearly ten years, then an opportunity came up at the BFI archive in 2009 and I came back.
FEED: When you returned to the BFI, what new implementations were there? How did the archive developed?
SM: At the time I arrived there was a big publicly funded project to build a new collections management system, a database to document all of the collections, digital and analogue. The BFI National Archive has one of the biggest moving image collections in the world. It was established in 1935, so it’s also one of the oldest. But its management systems were disparate – there were many of them and they didn’t join up or integrate, and they didn’t have open APIs.
We established new systems that have REST APIs that can deliver data in and out over networks, including the internet. That’s a revolution because it makes integration between systems more possible via API calls. We bought and bespoked an Adlib collections management system from Axiell and deployed a new data model that had just been ratified in the moving image world, which was based on a hierarchy.
Because we had built that big database system, we could integrate it into our digital preservation system when it came time to procure that. We were very lucky to get another public budget from our Unlocking Film Heritage programme to spend on a digital preservation infrastructure.
The ability to integrate these systems was a consequence of buying solutions that had open REST APIs, which means you can integrate systems. Instead of a collection being a closed box, it becomes a box that springs open on command. It allows you to deliver your media from your servers to your web applications via API integrations. And that’s what we do now. We can build collections access portals. We have one at the South Bank called the Media Tech, which is built on top of these open REST APIs. It lets anyone go to our BFI location at London South Bank and watch any of the television and film titles that we have digitised.
FEED: How do you prioritise the digitisation of material in the archives?
SM: We tend to work in five-year strategic blocks. Our previous five years was called Film Forever, and within that there was a project called Unlocking Film Heritage. For that, we digitised 10,000 films – 5000 from the BFI National Archive and 5000 films from our National and Regional Film Archive partners around the UK.
We made those 10,000 films available on our VOD platform, BFI Player, for UK viewers. We also made some of these titles available for UK and international viewers on the BFI Facebook and YouTube channels, and we had over 50 million views of those films through the project period. It was a major public access outcome for us.
Some of these films had not been seen for many decades. They included films from the Edwardian era – we have a collection called the Mitchell and Kenyon Collection, which is actuality footage of Britain from the 1900s and 1910s.
We’ve done a lot of digitising ourselves and still do, but we also developed a framework of digitisation partners in the UK who did a lot of the digitisation for that project. It really helped develop skills and technologies, and helped keep the digitisation of film economically viable and important in the UK.
The logical choice Two Spectra Logic tape libraries have been installed in the BFI’s Hertfordshire conservation centre to handle the organisation’s massive data storage needs
FEED: What guidelines and best practices do you employ when you’re selecting titles and preserving them digitally?
SM: We work with all our partners, with the National and Regional Film Archives and digitising suppliers, to develop a set of technical standards and deliverables. We base those on tried and tested film archive standards, including using the DPX format.
DPX is pretty much the standard for digitising film material for long-term preservation. It’s a very high-resolution photograph of every frame of the film. In real time, the scanner takes a high-quality image of each frame of the film and that frame becomes a single DPX file. You build up what is effectively a folder of these images, and that is your preservation copy. It could result in 100, or 150 or 200 thousand DPX images for each film.
Once you have that image, you can restore it, regrade it, remove scratches, but you will always have the original document that you scanned without any tampering or intervention. For film archives, that you document the film to the best quality and highest resolution you can achieve at the time is a really important principle.
“For film archives, that you document the film to the best quality and highest resolution you can achieve at the time is a really important principle”
We also have standards for web delivery. We have that preservation file, which is huge – they can be two terabytes each – then we have a mezzanine file that we create from that to use in day-to-day work. The sector tends to use ProRes Quicktime files for that.
We also create low bit rate viewing access proxy files for display online. Those tend to be H.264 MP4s.
FEED: What are the plans for the BFI National Archive down the line?
SM: The next big challenge is videotape digitisation – that’s the five-year block the BFI has just entered. We’re committing to digitising 100,000 works on videotape! We’re currently auditing the tape collections in our own archive and with our National and Regional Archive partners around the UK. What we hope to preserve is a complete corpus of British broadcasting history.
There will be a lot of TV, but there is also non-broadcast stuff. For decades, the world produced a lot of videotape content – industrial film, promotional film, music videos – so it’s a mixture of broadcast and non-broadcast. It’s also artist moving image and workshop collective filmmaking.
FEED: What are the big challenges for the television archiving project?
SM: There are huge challenges. First because there are no more manufacturers of the equipment for videotape playback. My colleagues in the conservation teams have had to create an archive of videotape machinery to play these tapes in order to digitise them.
The file standards for videotape digitisation are different, too. We don’t use DPX sequences, because you don’t capture a frame of video in quite the same way. But there are exciting developments in this field. There’s a new moving image preservation file format called FFV1 (FastForward Video Codec 1), which comes from the FFmpeg project to create a set of open, non-proprietary tools to use in video processing and preservation. Every archive and moving image organisation uses FFmpeg – it’s a non-proprietary codec and it’s lossless and compressed, so much more affordable for your huge collection.
The container is Matroska – with the file extension mkv. This is a new, revolutionary thing in our world, using FFV1 and Matroska as an open, non-proprietary solution built by a community of archivists. We’ll aim to preserve all of the video digitisation to that new FFV1-Matroska combination.
National service The BFI National Archive undertakes the huge task of preserving the moving image collection on behalf of the British nation, including film, broadcast and music videos
FEED: With that amount of material, how do you identify a title, catalogue it and assign metadata? Are you looking at using AI or other technology around that?
SM: For the film project, the 10,000 films we digitised were published on BFI Player. We had a clear understanding of what metadata we had to create to make it publicly searchable and comprehensible. That was baked into the project plan.
As part of the budget, we made sure we resourced the cataloguing, and we went really deep on that. We went so far as to do geographic location cataloguing of latitude and longitude for each film, building a map we called Britain on Film. On BFI Player, you can go to the Britain on Film map of the UK and zoom in to your town and watch films tagged with that latitude and longitude. We wanted to give back to UK viewers and say, “here are films from where you grew up, or where you were born, or where you live now, or where your grandparents lived”.
However, the video digitisation project, because we’re scaling up massively to try to digitise 100,000 titles, is a much bigger challenge. Metadata is definitely a more difficult and complex challenge. So we are looking at the potential of AI. It won’t solve the problem, but it’s potentially a useful methodology to give us more metadata than we could ever achieve by manual, human cataloguing.
Speech to text is a good example of what we hope to explore. A lot of the videotape will be from broadcast, so it will have not only spoken dialogue, but narration and commentary. And there’s object recognition and face detection. We hope to explore those machine learning potentials to see what metadata we can generate without having to undertake manual, human cataloguing. It’s a massive challenge for this project, because the scale of it is so large.
FEED: What storage infrastructure do you have to handle all this video data?
SM: Part of the five-year project to digitise film included building a preservation solution to store the zeros and ones. That was really a mammoth undertaking. We didn’t do a standard procurement where you write down your requirements in huge detail and someone sells you the equipment. Instead, we wrote down our objectives, our strategic aims, and procured a system integrator, Ovation Data. Their job was to make recommendations of all the system components to build our infrastructure, and that meant networking. We chose a 10Gbps network – you really need that when you’re moving these big files around all day every day. And they recommended a data tape storage solution. We went with Spectra Logic.
We deployed two Spectra Logic tape libraries at opposite sides of our conservation centre in Hertfordshire. We stocked one with LTO-6 tape – popular with archives because of its open architecture, and its road map is documented. The other library we stocked with IBM tapes, which are much denser. With LTO-6, we get about 2.5 terabytes per tape. With IBM, about 8.5 terabytes per tape.
The other thing we procured from Spectra Logic was BlackPearl. BlackPearl is Spectra Logic’s gateway to the data tape libraries. The reason BlackPearl is revolutionary is it’s a big disk cache, a lot of storage on disk but with a REST API interface, and that means you can get that BlackPearl data over REST API commands over the network.
BlackPearl also has the intelligence to store your data into both tape libraries, maybe across multiple tapes because the files are so big: it becomes an intelligence layer in front of your data tape libraries and manages the complexity. Spectra Logic publishes SDKs for BlackPearl in Python and other languages, so we can write Python applications that integrate with it, liberating us to be much more in control of our own digital preservation destiny.
FEED: And that’s all being used now?
SM: Yes, we built and deployed it about three years ago and since then it’s been filling up with data from collections held by the BFI National Archive. By last count we’re just over three petabytes of content stored. And that means three petabytes times two, because we have a clone – so six petabytes in total. But it grows every day. We have an automated ingest pipeline. We record television off air – 17 channels of UK television – automatically, which flows into the preservation system.
FEED: Why didn’t you go with a cloud solution?
SM: At the time it was a big no – not to say that it would be a big no if we refreshed our thinking in ten years.
First, at the time we procured it was difficult to enter into an agreement with a cloud provider that would give you certainty about where your files were being stored, which legal territory. The BFI National Archive stores the moving image collection on behalf of the British nation. We don’t own the rights – it’s not a commercial exercise. We had to guarantee that the national collection is stored in the UK.
Another reason was the network implications of getting your files to the cloud. Don’t forget these files can be two terabytes each and we’re digitising 10,000 of these things, as well as all the other digitisation and acquisition that we do. Shipping of the files from our environment to a cloud provider over the internet is a substantial bandwidth challenge. We do have a big internet breakout but nonetheless, you spend a lot of money and time occupying your network, getting your files from you to the cloud, and then if you want the files back from the cloud to use in your restorations, and so on.
And it’s still very costly to spin up a huge 2Gbps or 10Gbps internet connection, especially when you consider the public funding nature of the project. The recommendation of our system integrators was that if we provision data storage on site, we would get much more for our money in terms of cost-effective digital preservation.
“A lot of people are watching a lot of moving image content on the web. We need to find a solution for that, but I’m not going to pretend we have yet”
FEED: Does the BFI have any plans for archiving video material from the internet?
SM: It is in our strategy to try and address it. A lot of people are watching a lot of moving image content on the web. We need to find a solution for that, but I’m not going to pretend we have yet.
There still is a lot of web archiving that happens. The British Library has the responsibility for archiving the .uk web domain, and there is also a lot of web archiving done by The National Archives in the UK.
But archiving the moving image content on the web is not necessarily a solved problem.
FEED: Just choosing what to keep and what not to would be a huge question.
SM: The hours of video uploaded to YouTube every minute outstrips massively what any organisation, apart from Google, of course, could hope to preserve. How do you even begin to find an angle on that as a preservation challenge for people to look back on in 100 years?
Here’s an analogy: we can go to BFI Player and watch film from over 100 years ago. In fact, we’re currently digitising the entire corpus of the Victorian era. We have over 700 films in the collection from Victoria’s reign, with the first of them from 1895. Who’s to say in 100 years whether we’ll be able to watch web video from today? Can we preserve web video to make sure that in 100 years we can view it?
FEED: Will it be a public institution that does that? Or will it be whatever Google is then who decides what’s available?
SM: That’s a key distinction. The aim of a public archive is to be permanent and to persist through time. The BFI National Archive has been maintained since the 1930s. I believe the oldest moving image archive in the world is at the Imperial War Museum in the UK, which was formed during World War I. The aim of those archives is to persist through time, to guarantee – if you can guarantee – that the public will be able to see that material in 20, 50, 100 years, but it’s not in the mission of Google. As far as I know, they have no stated mission to preserve access to that material through time.
This article originally appeared in the October 2018 issue of FEED magazine.