Bsrgvty

Question

Assume Bob. Bob wants to convey the following, simple message (an example) to Cassandra:

Walk 5 feet forward, turn around 90 degrees clockwise, walk another 4 feet and dig 3 feet into the ground.

Easy, right? Now here's the twist: I require Bob to not state the message outright, but by somehow hide or convey it within other vocal expressions. This could be him talking or singing. Bob and Cassandra had the prior opportunity to agree on a code scheme, and that's what I am after.

there should not be any link to the hidden message within the words of Bobs utterance. So something like "use every first word of every second sentence" is not viable. The meaning of the actual spoken/uttered words can not play any role within the scheme.

They do not know beforehand, if a song or talking will be used, so both modes need to be viable. Bonus Points if even random screaming could be used.

The scheme should allow for an almost mathematical precision. There should be no doubt if Bob meant 3 or 4 feet.

Assume that Cassandra, the recipent, can hear the message clearly. Audio Transfer is not my point, I am only looking for an encoding scheme.

I imagine, that some parameters of human voice or soundwaves in general could be used. I am unsure hovewer which one. Volume shouldn't have any meaning, so amplitude is out, right? Frequency?

Ease of use is not a primary concern. If both need to be geniuses and have absolute pitch for your idea to work, so be it. If Cassandra needs to know "oh, boy thats 120 Hz right now", so be it.

Given my requirements, goals and constraints, is there a way to use some acoustic property of the human voice as an "additional channel" to convey a second (hidden) message? How would such a mapping work?

Number of syllables sound nice! I will think it trough. Care to expand it into an answer? I also expanded my constraints thanks to your comment. — user6415, Sep 10 at 22:28
How do you determine "best answer" here. (Too interesting to shut down as idea genereation...) — Sep 11 at 5:51
♫And I would walk 500 feet, turn right and walk 400 more, and I would dig 300 feet...♫ (The trick is knowing to divide everything by 100...) — Sep 11 at 13:24
Why is volume ruled out? Amplitude modulation in real life uses the "volume" of a radio signal just fine. It's not the absolute amplitude, but the change in amplitude, that carries the information. — Sep 12 at 2:18

HalfthawedHalfthawed 12.7k1 gold badge16 silver badges50 bronze badges · Accepted Answer · 2019-09-10 22:43:53Z

Morse code and syllable length - that is, use syllable length to encode a message in Morse code so a long syllable for a dash, a short one of a dot. Easy to decode if intercepted - yes, absolutely. But you didn't mention the possibility that anyone was listening in to find a hidden message, just that it needed to be encoded.

Without any training, it'll be slightly noticeable to be sure, though an excuse like 'My cadence varies when I'm nervous' could help. With training, though, you'll be able to keep the dash and dot syllables only slightly varied from true syllables, and thus perfectly viable.

Reminds me about the this military operation in which Colombian army sent a hidden message to hostages… using a pop song — Sep 12 at 6:10
How about Morse code with swear words? Anything with 4 letters is a dot, anything longer is a dash. Relatively easy to fit one letter into each normal sentence without having to think too hard, easy to decode and almost indistinguishable from a normal soldier's conversation. — Sep 12 at 14:05
@RobinBennett That could work if you don't mind swearing like a sailor. — Sep 12 at 17:11
Assuming the numbers are spelled out and commas omitted, but spaces need to be included, you need to encode 117 characters. Assuming you have some specific way of indicating a space between words, for Morse there will be 376 individual "characters" that have to be transmitted; dots, dashes, the space between letters, and the spaces between words. — Sep 12 at 20:22

score 16 · Accepted Answer · 2019-09-11 07:37:13Z

You can scream if you like, because it's not the sounds that matter, it's the silences between them.

The gaps between the words are what counts, whether you choose to encode in Morse or otherwise. The cadence of speech includes the gaps as well as the sounds and a bit of careful timing will allow you to pass the message.

The major downside of anything related to Morse code is that you have a very low information density. There's going to be a lot of screaming for you to get your message across.

GuranGuran 4,4981 gold badge13 silver badges28 bronze badges · Accepted Answer · 2019-09-11 05:49:40Z

Breath and word count

The simplest form of encoding I can come up with is this:

Wether Bob speaks or sings, pay attention to when he breathes. Count the number of words between each breath. It will be a number between one and eight. (If not, it is meaningless noice, a filler)

By combining two such numbers the scheme allows for 64 characters, more than enough for A-Z, numbers and space.

Granted, this will be very hard to encode/decode on the fly, but given some minimal preparation Bob can easily disguise any message in speech or song.

celtschkceltschk 29.2k12 gold badges79 silver badges145 bronze badges · Accepted Answer · 2019-09-11 06:46:14Z

You could hide the message in the word order where grammar allows it (obviously that works the better the more the used language allows to reorder words).

For example, consider the sentence:

Today I'll have pizza for lunch.

You can move "today" and "for lunch" to many different positions:

I'll have pizza today for lunch.

I'll have pizza for lunch today.

For lunch I'll have pizza today.

For lunch today, I'll have pizza.

Today for lunch, I'll have pizza.

So you can encode a digit between 1 and 6 in the word order. Note that the hidden message is independent of the obvious message; the very same digit can be hidden in sentences like

Yesterday I watched Doctor Who after work.

Sometimes I go swimming in the morning.

Now clearly there are sentences with different number of possible word orders. But it should be possible to come up with a code that works with arbitrary sentences (except that sentences with fixed word order won't be able to give any information).

A possible coding strategy could be as follows:

For each sentence, determine the number of possible word orders; let's call it the sentence capacity. For example, the example sentence above would have a capacity of 6. Then take as many sentences that the product of their capacities is at least 27 (enough to encode 26 letters and a space). These sentences give a code group.

Next, assign each sentence of the code group a number by lexicographically ordering the possible sentences and numbering them starting at zero. The example sentence is last in the order, therefore it would get the number 5.

Then, calculate the value of the code group by multiplying the number of each sentence by the capacities of all following sentences, and then adding it all together.

If the resulting value is zero, it is a space, if it is between 1 and 26, it describes a letter, and if it is larger, then the encoder made an error.

For example, consider the following text:

I'll have Pizza for lunch today. I bought it this morning. For dessert I plan to eat strawberries or cherries.

The first sentence has a capacity of 6, the second has a capacity of 2, the third has a capacity of 4 ("for dessert" can be put at the beginning or end, and also the order of strawberries and cherries can be flipped without changing the meaning).

The product of the capacities is 6×2×4=48, clearly larger than 27, but the first two sentences only give 12, so the code group consists of those three sentences.

The first sentence has two other possible orders preceding it in lexicographical order (the two variants starting with "for lunch"), so it gets the value 2. The second sentence is first in the list of possible word orders, so it gets the value 0. And the third sentence has only the one with strawberries and cherries switched preceding it lexicographically, thus it gets the value 1.

Thus the value of the code group is 2×2×4 + 0×4 + 1 = 17, which corresponds to a Q.

Dan HansonDan Hanson 6171 silver badge5 bronze badges · Accepted Answer · 2019-09-11 02:06:42Z

Steganography is the encoding and decoding of information hidden in plain sight within pictures, audio files, whatever. It's much easier if you allow hardware in the mix - for example, encoding a message in the noise below the audible level, or adjusting the frequency of each tone by just enough that the difference can be measured but not heard.

But you want to be able to do this by just singing or talking. For singing, one possible encoding scheme for a very good singer would be vibrato. You could send numerically encoded messages by controlling how many 'beats' of vibrato you use for each phrase.

But you want to be able to use talking as well, and the encoding can't be in the words according to your criteria. So that leaves things like pitch, duration, volume, and non-word sounds like breath intakes, duration between words, 'umm's and 'awws', etc. Skip volume, as it's too hard to determine absolute volume and you probably want it to work for varying levels of background noise,

For more complexity add them together. For example, taking in a breath then saying, 'um, we should go' could mean something completely different than just saying 'we should go', which could be different than sighing then saying the same thing. It's not the 'we should go' that matters, it's the patterns of speech around the words.

So, 'breath intake + um + rising tone at end of sentence' (as in a question) means one thing. 'breath exhale + 2ums in sentence + flat tone' means something else. Make up as many different combinations as you need to encode all the information.

The nice thing about encoding your message in the 'metadata' of talking instead of words or syllable lengths or something tied to specific words is that you could make it work with any text. What matters is not the text itself, but how you say it or sing it. You could read the phone book this way and still get your message across.

InnovineInnovine 5,17110 silver badges28 bronze badges · Accepted Answer · 2019-09-11 07:06:59Z

5

Something like Jeremiah Denton? As a POW, in a televised interview he blinked t-o-r-t-u-r-e in morse code.
https://youtu.be/rufnWLVQcKg

answered Sep 11 at 7:06

Innovine

5,17110 silver badges28 bronze badges

$begingroup$
Why I downvoted (no offense): While this is a interesting bit of information, it completly ignores all constraints and requirements
$endgroup$
– user6415
Sep 11 at 12:20

add a comment
|

MichaelSMichaelS 5,87114 silver badges28 bronze badges · Accepted Answer · 2019-09-12 08:40:19Z

Summary

You could use amplitude or frequency modulation of the voice to transfer data between 5 WPM (realistic maximum) and 20 WPM (probably superhuman). You could go arbitrarily high with synthetics or cybernetic implants (up to around 5 billion WPM).

Essentially, it involves slightly (or greatly, if stealth isn't an issue) raising and lowering your volume or pitch with reasonably accurate timing. These changes are interpreted by your partner as binary data, which can encode text or other information using a variety of formats.

A 5-bit, 32-character ASCII-like code is probably the fastest option, with Morse code being a little slower, but less prone to errors.

You could also use AM and FM at the same time (this is called quadrature amplitude modulation, or QAM⁰), but that would be extremely difficult for normal people to pull off, and I haven't discussed it here.

Amplitude and Frequency Modulation (AM and FM)

The obvious aspects of the human voice that translate into basic radio transmission schemes are amplitude and frequency. By rapidly changing either property you can encode lower-frequency sound waves in the higher-frequency carrier wave.

Animation of a simple base signal modulated with amplitude and frequency modulation.
_{AM / FM Examples.¹}

AM / FM Sound Waves

Instead of translating to radio waves and back, you can simply alter the amplitude or frequency of sound at very high speeds (about twice the speed of the maximum frequency you're representing). This might be possible for a synthetic, but would likely be impossible for a normal human in real time.

However, you can always encode the signal in non-real time. I'm not finding any data on the fastest larynx variations humans are capable of, but I'd guess it's no greater than 10 Hz, and probably less than that. Various studies have shown large muscles can twitch in 100 to 300 ms², which roughly translates to 3 to 10 Hz, anti-respectively.

One second of a 200 Hz signal would then take 40 or more seconds to encode. It would also likely be quite difficult to encode and decode.

On-Off and Frequency-Shift Keying

On-off keying³ is a simple way to modulate binary data using extreme amplitude shifts. The presence of noise represents logic high, while the absence of noise represents logic low. We can extend this by using two different, but present, amplitudes. Or using two different frequencies, known as frequency-shift keying⁴.

Image showing binary frequency-shift keying.
_{Binary Frequency-Shift Keying.⁵}
Image showing amplitude-shift keying.
_{Binary Amplitude-Shift Keying.⁶}

Encoding a Message in Binary

At this point, you have a simple, binary alphabet. You can use this alphabet to encode any kind of data you want. You can have fixed-length words that represent specific letters in a traditional alphabet (ASCII⁷ or Unicode⁸), variable-length words representing an intermediate set of symbols(Morse code⁹), binary data representing sound levels or image information, etc.

The biggest problem here is just the human factor. The more complex your information, the harder it is to feasibly encode and decode it. At some point, there's a physical limit. Your best case is likely to be something like ASCII, reduced to 5 bits, or 32 characters. At 2.5 Hz, each character takes two seconds.

Binary-encoded Morse code (BEMC) could also be used, but it takes about 6.2 bits per character (25% more).

_{(I wrote a simple C program¹⁰ to convert an input string to BEMC. To test normal-ish English, I picked a random Wikipedia article, got the article for "Lake Chub"¹¹, stripped newlines from the text, and used the contents as input for the program. The program processes alpha and numeric characters, along with spaces. Input consisted of 4972 characters, of which 4739 were processed and converted to 29690 bits. On average, this encoding used 6.27 bits per character. Processing only alpha characters (to compare to 5-bit alpha-only encoding), 4696 were processed and converted to 29024 bits, which used 6.18 bits per character.)}

The advantage of BEMC is that every dash and dot has both a high-to-low and a low-to-high transition for every character, so it's relatively easy to keep track of timing.

Technically, you have to mentally encode twice (once from alphabet to Morse, then again from Morse to binary), but in practice there's little distinction between BEMC and just using ternary (dash, dot, space) Morse code directly -- dashes are long periods of logic high with a short period of logic low, dots are short periods of logic high with a short period of logic low, and spaces are medium periods of logic low.

The average word length of typical writing is 4.8 characters¹². Add 1 character for spaces between words for 5.8 characters per word. An average text message is 7 words long¹³, or about 40 characters. At 5 bits per character, a text message takes 200 bits, or 80 seconds at 2.5 Hz. 20 seconds at 10 Hz.

Alternately, this equates to 29 bits per word, 5.2 WPM at 2.5 Hz, or 21 WPM at 10 Hz.

Using this in Practice

Obviously, all of this is impractical for normal purposes. There's a perfectly good way to communicate with the human vocal apparatus: speech.

But if you want to get short messages across, you could do so. The trick is shifting frequencies or amplitudes enough for the other person to consistently decode the data without errors, but not so much other people hear the difference.

I doubt there's any way to prevent others from realizing your speech or singing is weird, but they wouldn't immediately know what you were doing. Further, they wouldn't necessarily know your exact encoding, though they could easily figure it out from a recording.

The biggest problem here is likely to be keeping reasonably consistent timing over minute-long durations, but I'd guess it's doable.

Synthetics or Cybernetic Implants

A synthetic person, or a person with cybernetic implants, could plausibly use these techniques to reach much higher data transfer rates. The maximum frequency achievable in air is about 5 GHz¹⁴, limiting us to about 2.5 Gb/s, which is 86 million words per second, 5 billion WPM, 12 million texts per second, or 81 nanoseconds per text.

But in all likelihood, you could do much better with non-acoustic data transfer methods if you had access to these levels of electronics.

References

_{⁰ An Electronic Notes article, What is QAM: quadrature amplitude modulation. https://www.electronics-notes.com/articles/radio/modulation/quadrature-amplitude-modulation-what-is-qam-basics.php
¹ Taken from Wikipedia under the Creative Commons license. https://en.wikipedia.org/wiki/File:Amfm3-en-de.gif
² Research study, Fast and slow twitch units in a human muscle from 1971. Found at the Journal of Neurology, Neurosurgery, and Psychiatry website. https://jnnp.bmj.com/content/jnnp/34/2/113.full.pdf
³ A Wikipedia article, On-Off keying. https://en.wikipedia.org/wiki/On%E2%80%93off_keying
⁴ A Wikipedia article, Frequency-shift keying. https://en.wikipedia.org/wiki/Frequency-shift_keying
⁵ Taken from Wikipedia under the Creative Commons license. https://commons.wikimedia.org/w/index.php?curid=635074
⁶ A modified version of (5), submitted under the original license.
⁷ Table of ASCII codes. http://www.asciitable.com/
⁸Unicode Consortium's overview of Unicode. https://home.unicode.org/basic-info/overview/
⁹ babou's very awesome cs.stackexchange answer to Is Morse Code binary, ternary or quinary?. https://cs.stackexchange.com/a/39922
¹⁰ My C program hosted at the OnlineGDB C compiler. https://onlinegdb.com/BkMqmOvUS
¹¹ A Wikipedia article, Lake chub. https://en.wikipedia.org/wiki/Lake_chub
¹² A Peter@Norvig.com article, English Letter Frequency Counts: Mayzner Revisited or ETAOIN SRHLDCU. http://norvig.com/mayzner.html
¹³ A Crushh article, K, Wrap It Up Mom. https://crushhapp.com/blog/k-wrap-it-up-mom
¹⁴ Ron Maimon's physics.stackexchange answer to Is there an upper frequency limit to ultrasound?. https://physics.stackexchange.com/a/23427/90152}

Wow thanks michael! Doing such great and thorough work when the proverbial cake (the answer point bonus :) ) is already eaten is something I can admire! — user6415, Sep 12 at 8:59

score 4 · Accepted Answer · 2019-09-10 22:56:56Z

My first thought was that you can use principles of steganography here. but you've rejected the simpler patterns in the first bullet (there should not be any link to the hidden message within the words of Bobs utterance).

So, next option. Most people speak from the mouth, not nasally. You can try to have Bob speak regular words nasally. Choose any two type of words here - example monosyllable and disyllable. Based on this, you can now convert these sounds into a morse code for english character.

EDIT:

I realise now that you are only looking for a way to encode the information in a uniform manner, in which case, even regular speech using the right monosyllable and disyllable words will suffice.

Sorry, I did not understand how the nasal voice would matter in this context? — user6415, Sep 10 at 22:36
@openend I think I had misunderstood your question the first time to mean using an acoustic property in addition to regular speech. SO, I wanted to suggest using both nasal + regular voice for communication, with nasal being used to promote the scheme for morse. BUT after re-reading your question, I understand now that you are only looking for a way to encode the information in a uniform manner, in which case, even regular speech using the right monosyllable and disyllable words will suffice. — Sep 10 at 22:43

JRodge01JRodge01 1911 bronze badge · Accepted Answer · 2019-09-11 13:24:44Z

Many have covered encoding messages into speaking or writing, but I didn't see any covering singing.

There are many languages like Mandarin or Cantonese that use tonality to change the meaning of a word. Using tonality in a language that doesn't naturally ascribe meaning to it, especially during song, is a great way to hide messages.

For example, "forward" is a neutral tone, "turn right" is a rising tone, and "turn left" is a falling tone. A dipping tone (neutral, lower, back to neutral) could indicate down, while the opposite of a dipping tone could indicate up. Multiple words sung in a particular tone means "that direction for as many units as there are words. This schema allows encoding direction and magnitude into a song.

The phrase "when the bird flies high" sung in a rising tone indicates "turn right, then go forward 5 units". There can also be a pre-existing vocabulary for other important things, like mentioning a dog means "guards", describing how beautiful something is means "look out for something with description". For more examples of this, you can read letters sent from WW2 POWs that have hidden messages encoded into rather innocuous statements.

There could be a physical signal or even another tonal encoding to indicate "these are instructions", such as a certain chord or part of a song, like the bridge leading into the final chorus.

The time has finally come - neutral, walk 5 feet forward

for us to strike - rising tone, turn right, walk 4 feet

our enemies down - dipping tone, 3 feet below where you are

The combination of key words, tonality, and word count can be used to discretely convey hidden messages in song without tipping off casual listeners that there is a hidden message.

Cool answer, but what if rising is right, dipping is below, and neutral is forward, how do I turn left or around, tell someone to look up? Or is every left turn sung as three rights, rising tone, in a row with 4 words each? Still not sure on up... Singing was my first thought as well. "Follow the drinking gourd" and other underground railroad references are prevalent here, even if they don't exactly match the requirements (bolded meaning of words clause) — Sep 12 at 22:23
You're asking questions I feel are already addressed by my answer. I already identify how to turn left and how to encode meanings to phrases. If you're proposing a language that doesn't have a "turn left" direction, then you're going to have to combine other existing instructions to create one. — Sep 13 at 11:25
You're right, I skimmed too much and thought I had the full picture/focused too much on the implementation of this case. I would still question the slight difference between a falling and dipping tone - I think it would make this near impossible to use in a real life application for reliable directions. Unless there are very very very clear delineations for the end of a line in the song(cus you can't see them) — Sep 13 at 16:36
but thanks for answering/my apologies on questioning what I hadn't read thoroughly — Sep 13 at 16:37

Yakk 10.8k1 gold badge15 silver badges42 bronze badges · Accepted Answer · 2019-09-11 19:07:20Z

Morse is what probably comes to mind first, but, as pointed out, it is incredibly low density.

If you can have the pitch of the voice change by 16 distinguishable levels, you can use it to encode hex codes, which are reasonably denser way of encoding something than Morse.

Since you probably only need to encode 26 letters + 10 numbers, it gives us 36 hex numbers, which fits neatly into two hex codes:

Pitch. Hex
Pitch1 - 0
Pitch2 - 1
...
Pitch10 - A
Pitch16 - F

Letters A...Z - pitch combos from 00-1A

Numbers 0-9 - pitch combos from 1B-24

Actually there is so much surplus, that you might wish to either reduce alphabet to 16 values (to use single hex digit for encode) or use a higher digit number encoding, if you can expect to distinguish between 36 voice pitches). For example you could reduce 0 to O, 1 to I, use U for V, I for J and Y, get rid of Q etc. Some of this was used in early typewriters, for example. It all depends on how smart you can assume the intended recipient to be so that he can decode:

Walk5ftforwardturn9odegreesclockwisewalkanother4ftanddig3ftintoground.

Additionally you could have pitch ordering different so that lowest and highest pitches encode less used values, to reduce attention to pitch changes.

So now you have pitch modulated hex encoding, which could naturally be tied to syllables. Every 2 syllables encode one letter of your alphabet (including numbers).

Message is unimportant as are usually spaces between text.

aslumaslum 5,63515 silver badges28 bronze badges · Accepted Answer · 2019-09-11 15:03:27Z

Singing: Encoded in nonsense

There are a bunch of answers that cover spoken word. Many of them while slightly noticeable in normal conversation/speech would be extremely obvious when sung. There's no reason to not have two (or three if you have something else planned for screaming) different means of encoding your message.

Many songs have nonsense lyrics such as La La La or Na Na Na. For example Police's De Do Do Do, De Da Da Da has the chorus:

De do do do de da da da Is all I want to say to you 
De do do do de da da da Their innocence will pull me through 
De do do do de da da da Is all I want to say to you 
De do do do de da da da They're meaningless and all that's true

Which could easily be altered to encode your message, either using Morse code and varying between DO and DA or with some other scheme. And in fact you'll probably want to try and find some other scheme, because Morse code uses between 1 & 4 bits per letter, and the Police song for example only has 32 bits per chorus (96 with three repeats), meaning you probably only have around 40-50 letters for your message. That said you could probably
find a song with more nonsense syllables in it (maybe Goldfrapp's Oh La La @ ~110).

All that said, you will probably want to shorten the message some. Just removing the spaces, gives a 258 character message once converted to Morse code (190 if you discount spaces). You could shorten that considerably by not using proper full English. For example "Go N 5 f W 4 f Dig 3 f" is only 58 characters once converted Morse code which should easily fit in the Police song.

Remember that brevity is fairly important when encoding a message within another unless you're planning an hour long speech or a full concert. Almost any kind of encoding is going to have a slower transmission speed then the medium using your requirements.

score 2 · Accepted Answer · 2019-09-11 17:15:42Z

Steganography has already been brought up, but not the best method for it. Modern Steganography usually involves encoding binary into the least significant values of a media file which can be extracted by comparing a secret original file to the modified one to extract the differences. This creates such small variances that they are indistinguishable from normal noise, while also being 100% error proof.

How this applies to an audio file:

An audio file decompresses to a continuous string of values that can be represented by numbers. Let's say for this example that you are using an 8-bit depth. Your first clip might look like [122,143,201,203,198,152,100,84,...] and your second clip looks like [122,144,202,203,199,153,101,84,...]. To the human ear, these two tracks are indistinguishable, but by subtracting one from the other, you get the following binary [0,1,1,0,1,1,1,0] which is the letter "n" in ASCII.

Most modern audio files have a 44,100Hz sample rate with a 16-bit depth meaning that you can pack your entire example message into an audio file by adding a 0.15% fluctuation to 0.02 seconds of the audio file. (good luck hearing that!)

Normally, this type of steganography is impossible to decode without the source file, meaning you can not use it ad hoc thereby violating your second bullet point, but audio files have the unique quality of often being saved in stereo. It is very common for mono-track audio to be transmitted in stereo where both tracks are just the same track repeated. But in this case, you make one stereo track the source, and the other the encoded message so you can just extract one from the other. This is less secure than using a secret source file in that someone looking for an encoded message may be able to find it; so, you will need to make sure the message is encrypted before you encode it.

In this manner you can send a long and detailed message in any audio file (even just some random screaming). As long as the recipient has the right software and decryption key, they can read it.

JayJay 11.6k1 gold badge22 silver badges37 bronze badges · Accepted Answer · 2019-09-11 21:04:59Z

I'm not sure exactly what you're trying to rule out with your first bullet. Are you just ruling out simply jumbling the words or adding a bunch of null words, i.e. you don't want someone to be able to see the clear words in the song? Or do you mean that the spoken words cannot be used in any way?

Because there are lots of ways to encode a message in a cover text. Sure, if you say "take every other word", someone who is expecting a message of this sort could see suspicious words in the song. I recall a code that was really used to pass a message to a captured spy once (I forget the circumstances, have to look it up) where the rule was, "take the 3rd letter after every punctuation mark". Punctuation wouldn't work well for a verbal message, of course, but you could have a rule like "take every fifth letter" or "take the second letter after every 'e'" or some such.

You could apply some formula to the text. Like take the number of letters in each word and run these through some sort of formula. Or one syllable words count as a dot and multi-syllable words as a dash and use Morse code. Etc etc.

Use a true code as opposed to a cypher. Make up a list of code words in advance. Like if he's going to sing a love song, say that "love" means "clockwise" and "beautiful" means "counter-clockwise" and "sunset" means "feet" and "dress" means "dig" and so on. Then he translates the message into the code words and fills in words that are not code words to make a coherent song.

If you mean that the words cannot be used in any way to convey the message:

You say both people can be assumed to be extremely perceptive. So okay, let's say the song covers 2 octaves. That's 16 notes. Any combination of 2 notes has 256 possibilities. Let 26 of those combinations map to the letters of the alphabet, maybe a few more map to other symbols you need, like spaces and essential punctuation. Then translate the message into these notes. A catch to this is that a random collection of notes wouldn't sound like a real song and might be impossible to sing. But if you're only using 30 or so of 256 note-pairs, you could use the other 226 freely as fillers to try to "smooth out" the tune. Or you could say that only every 4th note counts or some such. I haven't tried this and I don't know how practical it would be.

Similarly, code the message in the lengths of the notes. Say use binary: A quarter note is 1, a half note is 2, a quarter note is 4, an eigth note is 8, a sixteenth note is 16. A rest marks the end of a group of notes making up a number. Then A=1, B=2, C=3, etc. Again, coding a message like this would make a highly discordant tune, you'd have to be able to add lots of nulls to smooth it out.

Collett89Collett89 1113 bronze badges · Accepted Answer · 2019-09-12 10:19:49Z

Take any well known tune ("Ode to Joy" came to mind to me for its simplicity).
Sing the tune with nonsense words - or perhaps alternate incorrect instructions - it is not the words that matter.

Treat the notes as binary - A 0 will be "in tune" and a 1 will be "off key".

Has the prerequisite that the encoder has reasonable singing ability (otherwise the decryption may come out all 1s).

An unfortunate side effect is that anyone familiar with the decryption method may read a lot of unintended nonsense from passing a karaoke bar

HalfthawedHalfthawed 12.7k1 gold badge16 silver badges50 bronze badges · Accepted Answer · 2019-09-10 22:43:53Z