This paper was published in the Procceedings of the ISSC '98, (Irish Signals and Systems Conference), DIT Kevin Street,Dublin ,Ireland 25-26 June 1998. |
Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works, must be obtained from the ISSC. Contact: A. O'Dwyer, School of Control Systems and Electrical Engineering,Dublin Institute of Technology(DIT), Kevin Street, Dublin 8. Telephone: (00)3531 -402 4992. email: aodwyer@dit.ie |
The objective of this paper is to investigate and compare the endemic
problems that are encountered when attempting to morph two or more
tones together to create a 'mongrel' sound in the process known as timbre
morphing. Three fundamental difficulties are analysed, namely the identification
of harmonics in a time-frequency distribution, the identification of onset
and decay times of these harmonics, and the problem of incorrect magnitudes
due to the obvious limitation caused by spectral leakage. Solutions to
these problems, based on the use of a time-frequency distribution other
than the traditional spectrogram, are proposed in this paper, thereby providing
the basis for improved timbre feature analysis and processing.
1 Introduction
When the term 'morphing' is mentioned to most people, all thoughts race to 'Visual Morphing' that can be seen in many ads on TV, and in prominent films such as 'Terminator2-Judgement Day', or in pop videos such as Michael Jackson's 'Black or White'. Few people realise that morphing can be done with sound as well. There are two main 'camps' who have been working on what is referred to as 'Timbre' or 'Audio' Morphing. Malcolm Slaney [1] lead a team which investigated the possibilities inherent in Audio morphing through the representation of sound in a multi-dimensional space. They developed a new approach based on separate spectrograms to encode the pitch and broad spectral shapes of the sound. These spectrograms are independently modified to create pleasing morphs between many sounds. The key to this approach of morphing is to correctly identify pitches. Slaney concludes that there is room for improvement through the development of better representations, better matching techniques and more natural sounding interpolation schemes - especially perceptually optimal interpolation functions. The second team who developed a practical approach to Timbre Morphing are the CERL Sound Group at the University of Illinois [2]. They developed a package called Lemur, which outputs a file containing a sequential list of frames, each describing a small portion of the sound. The time-frequency analysis method that Lemur uses to generate these frames is effective, but not optimal in its representation of a sound due to both temporal and spectral smearing. In the Audio Laboratory in TCD, a software package called 'Mongrel' is being developed which takes two tones and cross-breeds them to generate a mongrel tone which contains characteristics of both parent tones. The motivation behind this development is to aid contemporary electro-acoustic composers in their exploration of new timbres in their compositional process. Two of the main differences between this work and that already completed by other teams, is the use of improved time-frequency analysis representations, and a simplified software interface which will allow a composer to generate new sounds at leisure, hard disk space permitting!
To achieve the objective of morphing two tones into a third new tone,
the tones must be processed correctly to ensure accurate representation.
The accepted and logical method for causing maximum contribution on the
part of both parent tones is to identify certain 'features' in the contributing
spectrograms, and morph these characteristic features into one another.
In the program under development all signal manipulation and morphing is
to be implemented automatically, keeping user interaction to a minimum.
The user simply chooses the two tones to morph and how much of each tone
is required in the new sound. This paper sets out to show solutions
to some of the problems that are encountered in attempting to arrive at
an effective implementation.
2 Timbre Representation
Timbre is defined as that part of a sound which is neither loudness nor pitch. Despite this negative definition timbre is one of the primal attributes by which we characterise sounds. It is the fundamental reason that a saxophone for example, will be identifiable whether it is heard on the radio of the most modern hi-fi system, or over a battered old amoebic radio on a remote desert island with diabolical reception. Historically, research has identified that 'timbre space' contains both spectral and temporal dimensions. The importance of the temporal aspect can be appreciated from the fact that any harmonic tone - such as a piano tone - played backwards is perceived differently from the same tone played in its normal sequence, yet the spectral composition of both is identical. Thus, to allow the temporal development of spectral components in a musical sound to be observed, it is necessary to combine time-amplitude and frequency-amplitude representations. When the signal is observed using an a time-frequency representation it is imperative that these components can be effectively quantified.
The CERL sound group define a feature as being "any portion of the sound that is important in the morphing process" [3]. In order to facilitate a user-friendly package, a primary requirement is the use of automatic feature identification. During the morphing process, spectral features - which include the fundamental and harmonics -and transient features such as the onset, steady state and decay, which are known to contribute perceptually to what a listener hears, need to be identified.
The Short-Time Fourier Transform (STFT) or the Spectrogram [4] is an extension of the Fourier Transform (FT), where the FT is repeatedly evaluated for a running windowed version of the time domain signal. Each FT gives a frequency domain 'slice' associated with the time value at the window center. The STFT enables us to observe the temporal and spectral characteristics at any point in the 'timbre space'. However, the STFT has an intrinsic problem in that it introduces both spectral and temporal smearing. As such, its use as an analysis tool for feature identification is somewhat questionable.
Recently, the Wigner Distribution (WD) has been examined as an
optional time-frequency analysis method for musical signals, instead of
the STFT. It contributes less spectral smearing than the spectrogram, and
no temporal smearing, and as such, is a more accurate representation for
the purpose of musical synthesis. The WD does however, introduce unwanted
cross-terms into the time-frequency representation. Fortunately, the use
of an appropriate smoothing window in the smoothed discrete Pseudo Wigner
Distribution (SPWD) alleviates this problem, while still not introducing
smearing to the extent evident in the spectrogram. In Figure 1, both the
spectral and temporal resolution differences of the STFT and SPWD can be
clearly seen.
3 Feature Extraction
To date, the basis for timbre representation and manipulation has been
what is referred to as the 'classical' concept of tone quality which relates
timbre primarily to its spectral composition, i.e. to the pattern
of harmonics inherent in different instrument tones. The more modern position
gives full acknowledgement to the importance of the details of the temporal
evolution of individual harmonics. The issues involved in 'error free'
feature detection that require care include the spectral identification
of partials, and the temporal identification of salient points, such as
the onset and decay of harmonics. To quote Malcolm Slaney, "Every time
you make a decision, i.e. 'this is a feature', you have the chance
to make an error." In other words, you can identify it incorrectly. There
is also the concern of a mismatch in the number of features in two tones
to be morphed. There may be a partial in one sound with no corresponding
partial in the second sound. In this case, the existing partial is morphed
with a zero-magnitude partial. When we are faced with a tone to morph,
what is required is to choose a select number of features, identify them
in the distributions, and ultimately interpolate between the start and
end tones to locate any chosen intermediate timbre.
3.1 Fundamental and Harmonic Identification
The SPWD can be interpreted as a distribution of a signal's energy in time and frequency. Just as the STFT contains leakage due to the windowing process involved in generating the FFT, so will the SPWD, although the leakage will be less blatant. The first important step in the morphing process, is to identify which partials are going to be paired for morphing. It is visibly apparent that the accuracy of the spectrum is dependent on the number of samples analysed. When the spectrum is analysed to try and locate any partials - fundamental and harmonics - it should be possible to locate them, regardless of the FFT length. We can calculate the frequency of any component in the FFT output using
Fk= ( K/N ) * Fs (1)
where Fk is the frequency value at index or sample number
K, N is the length of the spectrum, and Fs is the sampling frequency.
However, we are not interested in just any frequency, rather, the harmonic
component frequencies, assuming we are dealing with harmonic spectra. To
locate these frequencies, we need to scan the spectrum for peaks at every
temporal location in the time-frequency representation. It is required
of us to check the peaks at every temporal location, as the harmonics will
vary in amplitude and frequency location temporally. To speed up this process,
we estimate the size of the separation of the harmonics, in other words,
an index corresponding to the frequency of the fundamental. This
index is our margin of error, or 'error index'. To speed up the
harmonic search and locate process, the algorithm should allow us to jump
forward a few samples in the spectrum if there are no significant magnitudes
after a few trials, hence meaning that we don't have to analyse every single
value in the sample individually. This 'skip' amount varies relative to
the length of the sample, and the sampling frequency. This harmonic
search algorithm has been found to work successfully and efficiently However,
the use of the SPWD rather than the STFT allows for more accurate partial
identification, as is obvious from Figure 2.
3.2 Onset/Decay Identification
After locating the spectral features from the signal, the next thing
that must be done before morphing can be affected, is to locate the (onset
and decay) temporal features. Using the STFT, this analysis can be inaccurate,
due to the intrinsic temporal spreading. By replacing the STFT with the
SPWD, this problem is eased, and an environment for accurate temporal identification
can be said to exist. In this analysis, each harmonic must be analysed
individually, to locate its salient features. When the harmonic drops to
90% of its steady state magnitude then it is deemed to have begun to decay.
Conversely, when the magnitude of the harmonic location, rises beyond 10%
of its steady state value, it can be considered that the attack, or onset
has begun, ending at 90% of steady state magnitude. It must also
be considered that the harmonic is prone to 'frequency wobble', a phenomenon
that can not be described as 'vibrato' because it is not as audible as
this, yet still existent. This can be easily compensated for by analysing
slightly above and below the correct frequency location - a quarter tine
variance for example -at each temporal analysis point.
3.3 Amplitude Identification
To correctly identify the amplitude of the signal, the effects of sampling
upon the magnitude representation of the signal must be considered. As
is well documented, sampling potentially introduces an error due to the
creation of frequency bins. This error shows up in the form of a frequency
spread that can become rather dramatic as the critical midpoint of the
FFT bin is reached. Analysis shows that this error is in fact proportional
to the distance from the edge of the FFT bin. In fact, by setting up an
analysis program, the correct amplitude of any spectral value can be estimated
simply by using the relationship between leakage and proximity to the bin
edges.
Above in Figures 3.1 and 3.2, we can see two tones and a mongrel tone
which contains 50% of each of the two original tones. The morph was created
using the ideals that have been laid out in this paper. Looking closely
at the mongrel tone, it can been seen that the pureness of the clarinet
sound has blended itself into the vibrato and jaggedness of the violin
tone. It can also be seen that the magnitude of the mongrel tone is approximately
half way between the two parent tones for all temporal and spectral content.
It must be remembered however, that what can be seen as alterations are
not always the most perceptually identifiable changes, so to ensure that
the morph is in fact approaching a compromise between the two parent tones,
sample tests on listeners will have to be carried out.
4 Conclusions
As mentioned in section 3.3, an intrinsic problem with the FFT, is that
we are dealing with sampled signals, which introduces an error due to the
creation of frequency bins. By using the SPWD instead of the STFT, this
error can be reduced, allowing for improved spectral analysis. Similarly,
as the SPWD does not cause the temporal smearing of the STFT, the dynamic
attributes which have been shown to be perceptually significant can be
identified more accurately. As a result of the use of more appropriate
joint time frequency (JTF) distributions combined with an analysis of leakage,
solutions to the problems of identifying individual frequencies in any
given signal, and also identifying the onset and decay locations of these
frequencies can be better approached. For the purposes of musical synthesis
deriving from timbre morphing, the improved accuracy of the newly discussed
methods will allow for more accurate extraction and manipulation of those
features which characterise musical timbre.
5 References
[1] M. Slaney, M. Covell and B. Lassiter, "Automatic Audio Morphing,"
in Proc. ICASSP, IEEE, Atlanta, GA, May 7-10, 1996.
[2] Bill Walker and Kelly Fitz of the CERL Sound Group, "Lemur"
University of Illinois, Urbana, IL 61801,USA
[3] B. Holloway, E. Tellman and L. Haken, "Timbre Morphing of Sounds
with Unequal Numbers of Features," J. Audio Eng. Soc. vol. 43, no.
9, pp 678-689, September 1995.
[4] D.J. Furlong, C.J. Hope,"Time-Frequency Distributions for Timbre
Morphing: The Wigner Distribution versus the STFT" Proc. SBCMIV, Brasilia,
Brasil, August 1997.