What is a sound? What does it look like?
Sound is a physics wave generated by vibrating objects and passing through certain medium (which is air formost of the time) from one location to another. We use a tuning fork as an example to show how a sound is generated. When the tines of the tuning fork vibrate, they start to disturb surrounding air molecules. Through the mechanism of particle interactions, the disturbances are passed on to adjacent air molecules. The motion of disturbance is referred to as a sound wave.

Figure 1. A tuning fork and its vibration.
As the tines push forward, the air molecules are compressed together. As the tines move back, the air molecules spread out. Because of the repeating motion of the tines, the air medium contains two distinguishing regions: compressions and rarefactions. The compressions are the one with high air pressure while the rarefactions have low air pressure.

Figure 2. The compressions and rarefactions regions of sound wave.
We have the figure with the air pressure as y-axis and time as x-axis. Then we can obtain a sound wave which we are familiar with.

Figure 3. The pressure of sound wave.
Now we already know what a sound is in theory. Let's try something by hand, which can make the theory more convincing and comprehensible. The first steop is plugging a microphone into the sound card interface of a computer and then opening a windows application called sound recorder. Now please push the red circle button to start recording. During the recording, please speak to the microphone or sing a song you like. The last step is to push the square button to stop recording. Now you will see the wave of your recorded sounds similar to the following figure.
Figure 4. Sound wave in Microsoft sound recorder application.
What should we do if we want to "see" the sound of songs from a favorite CD? We can use softwares such as windac (http://www.windac.de/) and nero (http://www.nero.com/us/index.html) to copy/transfer the song of a CD (digital format) into a wave (analog format). Then we can use cool editor (http://www.adobe.com/special/products/audition/syntrillium.html) to open the wave to see the feature of the sound.
Figure 5. Sound wave in cool editor software.
¡@
What is a pitch? How can we analyze a sound wave?
The frequency of a wave refers to how often the particles of the medium vibrate when a wave passes through the medium. It is measured as the number of complete back-and-forth vibrations of a particle of the medium per unit of time.
1 Hertz = 1 vibration/second
Figure 6. Sound waves with different frequencies.
The sensations of these frequencies are commonly referred to as the pitch of a sound. A high pitch sound corresponds to a high frequency and a low pitch sound corresponds to a low frequency.
Let's hear some examples of pitches corresponding to different
frequencies.
|
Pitch
|
Frequency
|
|
C4
|
261
|
|
C5
|
523
|
|
C#5
|
554
|
|
E2
|
82
|
The relationships between music interval and frequency ratio are given in the following table:
|
Interval
|
Frequency Ratio
|
Examples
|
|
Octave
|
2:1
|
512 Hz and 256 Hz
|
|
Third
|
5:4
|
320 Hz and 256 Hz
|
|
Fourth
|
4:3
|
342 Hz and 256 Hz
|
|
Fifth
|
3:2
|
384 Hz and 256 Hz
|
The map of guitar frequency and pitch is shown at http://www5.ocn.ne.jp/~dgb/fret/Freq_gte.htm
How can we extract notes from a sound wave?
Fast Fourier Transform (FFT) is a mathematical equation
which can transform a sound wave from the time domain into the frequency domain.
In order to provide a simple explanation, we define it as a mathematical tool
which can tell you the frequency of a sound wave. For example, if we have a
continuous signal as shown below,
The result of its FFT is the frequency as follows:
Figure 7. A waveform and the result of its Fast Fourier Transform
Once we know the frequency, we can obtain the pitch of the sound wave. To extract each note from a sound wave, theoretically we have to cut the sound wave into small pieces and conduct the Fast Fourier Transform for each piece. If the piece is small enough, we can obtain each note of the sound.
Example: non-overlapped notes
G3 - C4 - E4 - G4 - C5 - E5
Figure 8. Sound wave of input audio sequence G3 - C4 - E4 - G4 - C5 - E5
Output frequencies:
196 - 261 - 329 - 392 - 523 - 659
What makes things so difficult?
Music rarely consists of sound waves of a repeated single frequency. Few people would be impressed by an orchestra that plays music wtih notes of a single tone for all instruments. Rather, instruments are known to produce overtones resulting in a sound that consists of multiple frequencies. Such instruments are described to be rich in tone color. Music is a mixture of sound waves which typically have whole number ratios between the frequencies associated with the notes.
Our Solution: Pre-emphasis and Overtone Removing
We use overlapped sliding-windows in our pitch detection algorithm. Each window has a duration of 0.25 second and the windows overlapp with adjacent ones in half of the window. We use the pre-emphasis technique, which is very common in speech recognition, to reduce the effect of previous residential notes on current one. The pre-emphasis method works as follows. For each window, the result of fft will subtract 0.97 times the fft result of previous window. Therefore, the new note in the latter window will be extracted and the existing notes from the previous window will be removed. We also define an amplitude threshold. Only the pitch with the amplitude above the threshold will be considered as a note.
To manage the overtone, we simply define
a latter note as an overtone if its frequency is a integer times the previous
note frequency. For example, if the first note is the middle C with frequency
256 and the second note is C5 with frequency 512, then we consider C5 as the
overtone of C4 and ignore it in our score interpretation.
|
|
|
|
* Figure 1,2, 3,and 6 are obtained from Physics classroom, http://www.physicsclassroom.com/mmedia/waves/edl.html