Parallel Audiobook Corpus
Data CreatorRibeiro, Manuel Sam
PublisherUniversity of Edinburgh. School of Informatics
MetadataShow full item record
CitationRibeiro, Manuel Sam. (2018). Parallel Audiobook Corpus, [dataset]. University of Edinburgh. School of Informatics. https://doi.org/10.7488/ds/2468
DescriptionThe Parallel Audiobook Corpus (version 1.0) is a collection of parallel readings of audiobooks. The corpus consists of approximately 121 hours of speech at 22.05KHz across 4 books and 59 speakers. The data is provided in two formats. Chapter data contains the audiobook recording at the chapter level. Each chapter-level waveform is accompanied by the text and its respective word-level alignment. This format can be used if you are looking for a segmentation that does not correspond to utterance-level units. Segmented data provides a more traditional format for the corpus. The chapter-level alignment was segmented into utterances with waveforms organized by speaker. Note that, within each book, utterance identifiers are consistent across speakers, making it simple to find parallel data.
The following licence files are associated with this item: