First, the corpus contains two layers of annotation, at the phonetic and orthographic levels.
The same holds true of text corpora, in the sense that the original text usually has an external source, and is considered to be an immutable artifact.
Any transformations of that artifact which involve human judgment — even something as simple as tokenization — are subject to later revision, thus it is important to retain the source material in a form that is as close to the original as possible.
TIMIT was developed by a consortium including Texas Instruments and MIT, from which it derives its name.
It was designed to provide data for the acquisition of acoustic-phonetic knowledge and to support the development and evaluation of automatic speech recognition systems.
Finally, TIMIT includes demographic data about the speakers, permitting fine-grained study of vocal, social, and gender characteristics.