How do we know for sure a transliteration is lossless?Answering Tag QuestionsLanguages with alphabets sharing the same basic shapes as ArabicAre there any existing guidelines for romanizing Aynu Itak?How to remove an accent from a language (and what an accent actually is)Conjunctions between complex clauses - which items do they coordinate?What is the purpose of transliteration?

How do I install this weird looking i9 9900K I bought?

Build a matrix from the coordinates of its elements and complete it with zeros

Why did they design new connectors for USB?

Are lances and nets and versatile weapons considered one-handed weapons?

Is lens flare shot organically, or added in post-production?

I want to have a bond with a baby dragon. Can I?

Why does California seem to have much more aggressive Consumer Protection and Safety Legislation?

Why do Russian names transliterated into English have unpronounceable 'k's before 'h's (e.g. 'Mikhail' instead of just 'Mihail')?

Is leave-one-out cross validation known to systematically overestimate error?

What should be done when the theory behind a PhD thesis turns out to be wrong?

Feeling of forcing oneself to do something

Want to publish unpublished work found in an auction storage unit

Is there any plausible in-between of Endotherms and Ectotherms?

If a picture of a screen is a screenshot, what is a video of a screen?

Bridge rectifier outputting 338 volts DC with 120 volts AC input

Does a patron have to know their warlock?

Is using Observer pattern a good idea while building a Chess Game?

Why does UNIX ed not have a prompt by default

Using "disk-only" rim (proper sidewalls) with rim brakes

How Long Should a Hash be to be Absolutely Secure?

Am I obligated to pass on domain knowledge after being let go?

How do oases form in the middle of the desert?

Is American Express widely accepted in Hong Kong?

How can I customize the Touch Bar interfaces for my tremor?



How do we know for sure a transliteration is lossless?


Answering Tag QuestionsLanguages with alphabets sharing the same basic shapes as ArabicAre there any existing guidelines for romanizing Aynu Itak?How to remove an accent from a language (and what an accent actually is)Conjunctions between complex clauses - which items do they coordinate?What is the purpose of transliteration?






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty
margin-bottom:0;

.everyonelovesstackoverflowposition:absolute;height:1px;width:1px;opacity:0;top:0;left:0;pointer-events:none;








4


















Looking at this it says it's lossless (Wylie Transliteration).



ག ga
ང nga
ཉ nya
ན na


What if you had sequences like ནག (ng, or is it naga)? Is it lossless because we can guarantee that every consonant (or consonant bundle as in some of the letters) is separated by a vowel? (I don't really know Tibetan yet, so please excuse my ignorance).



IAST for Sanskrit is another lossless one.



So we have:



त t
ह h
थ th


There's many more that have this (seemingly) same problem. You can write the same thing multiple ways.



तह th
थ th


So if you had this in the original Sanskrit:



थथथथथथथथ


You would maybe transliterate it as:



thththththththth


Then you might do this to go back:



तहतहतहतहतहतहतहतह


Or any of these combos:



थतहतहतहतहतहतहथ
थतहतहतहतहतहथथ
...


What you end up with is not necessarily what you started with. How do they say this is lossless? Does it have the same property that every consonant/letter is separated by a vowel?



Is there never a "t + h" sound (t followed by standalone "h") in sanskrit, as opposed to a "th" sound (aspirated t)? What if we say there isn't, but then later we discover one? This is where I'm lost, it seems that such systems aren't really lossless.



Can one explain how these are actually lossless? How can you prove that it's lossless, maybe not so far as a mathematical proof, but a thought experiment or something perhaps?



It would also be nice to know which languages have lossless transliterations out there available, I would like to check them out :)










share|improve this question






















  • 1





    If there were no consonant clusters it would be simple, but I'm seeing pages say that Sanskrit does have some clusters, so a lossless transcription would be tricky. Maybe the clusters you identified just don't occur?

    – curiousdannii
    Sep 28 at 13:35

















4


















Looking at this it says it's lossless (Wylie Transliteration).



ག ga
ང nga
ཉ nya
ན na


What if you had sequences like ནག (ng, or is it naga)? Is it lossless because we can guarantee that every consonant (or consonant bundle as in some of the letters) is separated by a vowel? (I don't really know Tibetan yet, so please excuse my ignorance).



IAST for Sanskrit is another lossless one.



So we have:



त t
ह h
थ th


There's many more that have this (seemingly) same problem. You can write the same thing multiple ways.



तह th
थ th


So if you had this in the original Sanskrit:



थथथथथथथथ


You would maybe transliterate it as:



thththththththth


Then you might do this to go back:



तहतहतहतहतहतहतहतह


Or any of these combos:



थतहतहतहतहतहतहथ
थतहतहतहतहतहथथ
...


What you end up with is not necessarily what you started with. How do they say this is lossless? Does it have the same property that every consonant/letter is separated by a vowel?



Is there never a "t + h" sound (t followed by standalone "h") in sanskrit, as opposed to a "th" sound (aspirated t)? What if we say there isn't, but then later we discover one? This is where I'm lost, it seems that such systems aren't really lossless.



Can one explain how these are actually lossless? How can you prove that it's lossless, maybe not so far as a mathematical proof, but a thought experiment or something perhaps?



It would also be nice to know which languages have lossless transliterations out there available, I would like to check them out :)










share|improve this question






















  • 1





    If there were no consonant clusters it would be simple, but I'm seeing pages say that Sanskrit does have some clusters, so a lossless transcription would be tricky. Maybe the clusters you identified just don't occur?

    – curiousdannii
    Sep 28 at 13:35













4













4









4


2






Looking at this it says it's lossless (Wylie Transliteration).



ག ga
ང nga
ཉ nya
ན na


What if you had sequences like ནག (ng, or is it naga)? Is it lossless because we can guarantee that every consonant (or consonant bundle as in some of the letters) is separated by a vowel? (I don't really know Tibetan yet, so please excuse my ignorance).



IAST for Sanskrit is another lossless one.



So we have:



त t
ह h
थ th


There's many more that have this (seemingly) same problem. You can write the same thing multiple ways.



तह th
थ th


So if you had this in the original Sanskrit:



थथथथथथथथ


You would maybe transliterate it as:



thththththththth


Then you might do this to go back:



तहतहतहतहतहतहतहतह


Or any of these combos:



थतहतहतहतहतहतहथ
थतहतहतहतहतहथथ
...


What you end up with is not necessarily what you started with. How do they say this is lossless? Does it have the same property that every consonant/letter is separated by a vowel?



Is there never a "t + h" sound (t followed by standalone "h") in sanskrit, as opposed to a "th" sound (aspirated t)? What if we say there isn't, but then later we discover one? This is where I'm lost, it seems that such systems aren't really lossless.



Can one explain how these are actually lossless? How can you prove that it's lossless, maybe not so far as a mathematical proof, but a thought experiment or something perhaps?



It would also be nice to know which languages have lossless transliterations out there available, I would like to check them out :)










share|improve this question
















Looking at this it says it's lossless (Wylie Transliteration).



ག ga
ང nga
ཉ nya
ན na


What if you had sequences like ནག (ng, or is it naga)? Is it lossless because we can guarantee that every consonant (or consonant bundle as in some of the letters) is separated by a vowel? (I don't really know Tibetan yet, so please excuse my ignorance).



IAST for Sanskrit is another lossless one.



So we have:



त t
ह h
थ th


There's many more that have this (seemingly) same problem. You can write the same thing multiple ways.



तह th
थ th


So if you had this in the original Sanskrit:



थथथथथथथथ


You would maybe transliterate it as:



thththththththth


Then you might do this to go back:



तहतहतहतहतहतहतहतह


Or any of these combos:



थतहतहतहतहतहतहथ
थतहतहतहतहतहथथ
...


What you end up with is not necessarily what you started with. How do they say this is lossless? Does it have the same property that every consonant/letter is separated by a vowel?



Is there never a "t + h" sound (t followed by standalone "h") in sanskrit, as opposed to a "th" sound (aspirated t)? What if we say there isn't, but then later we discover one? This is where I'm lost, it seems that such systems aren't really lossless.



Can one explain how these are actually lossless? How can you prove that it's lossless, maybe not so far as a mathematical proof, but a thought experiment or something perhaps?



It would also be nice to know which languages have lossless transliterations out there available, I would like to check them out :)







cross-linguistic transliteration






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Sep 28 at 13:52







Lance Pollard

















asked Sep 28 at 11:03









Lance PollardLance Pollard

2,3436 silver badges20 bronze badges




2,3436 silver badges20 bronze badges










  • 1





    If there were no consonant clusters it would be simple, but I'm seeing pages say that Sanskrit does have some clusters, so a lossless transcription would be tricky. Maybe the clusters you identified just don't occur?

    – curiousdannii
    Sep 28 at 13:35












  • 1





    If there were no consonant clusters it would be simple, but I'm seeing pages say that Sanskrit does have some clusters, so a lossless transcription would be tricky. Maybe the clusters you identified just don't occur?

    – curiousdannii
    Sep 28 at 13:35







1




1





If there were no consonant clusters it would be simple, but I'm seeing pages say that Sanskrit does have some clusters, so a lossless transcription would be tricky. Maybe the clusters you identified just don't occur?

– curiousdannii
Sep 28 at 13:35





If there were no consonant clusters it would be simple, but I'm seeing pages say that Sanskrit does have some clusters, so a lossless transcription would be tricky. Maybe the clusters you identified just don't occur?

– curiousdannii
Sep 28 at 13:35










3 Answers
3






active

oldest

votes


















9



















A transliteration system is usually either designed to be lossless, or not. To know whether it is or not, you have to know the target language.



Lossless transliteration systems generally have to use one of four methods to stay unambiguous:



  • Don't use digraphs at all. Write every phoneme with a single character: ŋ instead of ng, x instead of kh, þ instead of th, etc.

  • Use a letter for digraphs that never appears elsewhere. Some transliteration systems for Russian reserve h for digraph use, so that kh, ch, sh are unambiguous: there's no such thing as an h on its own.

  • Use digraphs that are illegal consonant clusters in the language. Ancient Greek (Attic dialect at least) had all of /t/, /h/, and /tʰ/—but transcribing them t, h, th is unambiguous, since /th/ can never occur (depending on your analysis of words like μέθοδος).

  • Add a special way to disambiguate between the two. Swahili has both /ⁿg/ and /ŋ/; the former is written ng, the latter ng'. The "library transliteration" of Arabic uses th for /θ/, and t'h for /th/.

The first is sometimes considered cleanest, but tends to very quickly exceed the limits of ASCII.



The second works well for certain digraphs, less well for others: a language with /ŋ/ probably also has /n/ and /g/ already.



The third works great until you discover that your assumptions about what's illegal were wrong! This happened famously in Inuktitut: the orthography was designed with the assumption that the sequence /nŋ/ was illegal, so they use the equivalent of nng for a geminate /ŋŋ/. Except then some lesser-known dialects do have /nŋ/, and they had to retrofit in an awkward solution. Oops.



The fourth is the easiest to retrofit onto an existing system, and is fairly widespread. If you aren't already using the apostrophe for something, it's an easy way to fix pretty much any ambiguities that come up.






share|improve this answer






















  • 1





    #3 is exactly what I was thinking would be the problem of every transliteration. How can you guarantee this is the case for Sanskrit and Tibetan?

    – Lance Pollard
    Sep 28 at 17:24






  • 1





    @LancePollard Be a fluent/native speaker, ideally. The problem with Inuktitut is that they wanted a system that would work for every dialect, but didn't realize a couple outlying dialects allowed this particular consonant cluster—the people working on it were fluent, and knew well that there was no such cluster in Nunavut.

    – Draconis
    Sep 28 at 17:25











  • @LancePollard in the case of languages with extensive written corpora (IMHO Sanskrit might qualify) and sizeable dictionaries you can test and verify such assumptions with simple string search/regular expressions.

    – Peteris
    Sep 29 at 7:41






  • 1





    If anyone's interested, the "awkward solution" @Draconis refers to is described here: tusaalanga.ca/node/2520. Essentially they replaced the digraph with <ŋ> so that we have <nŋ> and <ŋŋ> to distinguish the two cases.

    – jogloran
    Sep 29 at 21:13







  • 2





    @jogloran Indeed! It's awkward mostly because it breaks compatibility with the other dialects. And, more controversially, they decided to cease the official use of syllabics in those dialects until they could find a workaround—which breaks compatibility even worse. Eventually I'm assuming the "ng"-ligature used in standard syllabics will get replaced with a new "ŋ" character, and maybe the awkward r/q problems will get fixed too—but it won't be a quick, or easy, change at this point.

    – Draconis
    Sep 29 at 21:17


















2



















Regarding Wylie, the problem you describe is part of the issue noted in the Wikipedia article you linked:




Wylie's original scheme is not capable of transliterating all Tibetan-script texts. In particular, it has no correspondences for most Tibetan punctuation symbols, and lacks the ability to represent non-Tibetan words written in Tibetan script...




I believe the EWTS variant addresses most of the ambiguities, but transliterating to/from Wylie inherently requires knowledge of Tibetan orthographic rules for identification of root letter and which stacks, prefixes, and first- and second-suffixes are valid. For at least a few three-letter words with no vowel mark, the rule is essentially just a special-case for the particular word.



So indeed, this kind of transliteration system is very limited and suffers from the problems you expected. Other transliteration systems like Romaji (at least as I understand it) don't; the difference is ability to preserve character boundaries unambiguously.



Arguably, such transliteration systems are obsolete in the age of Unicode anyway.






share|improve this answer
































    1



















    Sanskrit is lossless. तह is romanized taha, and there is no cluster th distinct from the aspirated consonant romanized as th, spelled थ. You omitted virama in your spelling of bare "t", i.e. त्.
    You can't later discover that there is "t+h" in Sanskrit because there isn't, though you could wonder, how would the Sanskrit grammarians doing fieldwork on Arabic render the cluster "th". Maybe they would write त्ह. It is possible that problems arise in transcribing grammarian metalanguage, which massively violates the rules of Sanskrit. Since I guess you didn't know that there is no t+h cluster in Sanskrit, that relates to how you'd know if a system is lossless – you have to know the target language, and compare the facts of the language to what you know about spelling. I conjecture that North Saami is lossless w.r.t. pronunciation of written words, up to the point of social indeterminacy (are Norwegian u and y adopted into the language with the same vowel or different vowels?).






    share|improve this answer


























      Your Answer








      StackExchange.ready(function()
      var channelOptions =
      tags: "".split(" "),
      id: "312"
      ;
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function()
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled)
      StackExchange.using("snippets", function()
      createEditor();
      );

      else
      createEditor();

      );

      function createEditor()
      StackExchange.prepareEditor(
      heartbeatType: 'answer',
      autoActivateHeartbeat: false,
      convertImagesToLinks: false,
      noModals: true,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: null,
      bindNavPrevention: true,
      postfix: "",
      imageUploader:
      brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
      contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/4.0/"u003ecc by-sa 4.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
      allowUrls: true
      ,
      noCode: true, onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      );



      );














      draft saved

      draft discarded
















      StackExchange.ready(
      function ()
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2flinguistics.stackexchange.com%2fquestions%2f33674%2fhow-do-we-know-for-sure-a-transliteration-is-lossless%23new-answer', 'question_page');

      );

      Post as a guest















      Required, but never shown


























      3 Answers
      3






      active

      oldest

      votes








      3 Answers
      3






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes









      9



















      A transliteration system is usually either designed to be lossless, or not. To know whether it is or not, you have to know the target language.



      Lossless transliteration systems generally have to use one of four methods to stay unambiguous:



      • Don't use digraphs at all. Write every phoneme with a single character: ŋ instead of ng, x instead of kh, þ instead of th, etc.

      • Use a letter for digraphs that never appears elsewhere. Some transliteration systems for Russian reserve h for digraph use, so that kh, ch, sh are unambiguous: there's no such thing as an h on its own.

      • Use digraphs that are illegal consonant clusters in the language. Ancient Greek (Attic dialect at least) had all of /t/, /h/, and /tʰ/—but transcribing them t, h, th is unambiguous, since /th/ can never occur (depending on your analysis of words like μέθοδος).

      • Add a special way to disambiguate between the two. Swahili has both /ⁿg/ and /ŋ/; the former is written ng, the latter ng'. The "library transliteration" of Arabic uses th for /θ/, and t'h for /th/.

      The first is sometimes considered cleanest, but tends to very quickly exceed the limits of ASCII.



      The second works well for certain digraphs, less well for others: a language with /ŋ/ probably also has /n/ and /g/ already.



      The third works great until you discover that your assumptions about what's illegal were wrong! This happened famously in Inuktitut: the orthography was designed with the assumption that the sequence /nŋ/ was illegal, so they use the equivalent of nng for a geminate /ŋŋ/. Except then some lesser-known dialects do have /nŋ/, and they had to retrofit in an awkward solution. Oops.



      The fourth is the easiest to retrofit onto an existing system, and is fairly widespread. If you aren't already using the apostrophe for something, it's an easy way to fix pretty much any ambiguities that come up.






      share|improve this answer






















      • 1





        #3 is exactly what I was thinking would be the problem of every transliteration. How can you guarantee this is the case for Sanskrit and Tibetan?

        – Lance Pollard
        Sep 28 at 17:24






      • 1





        @LancePollard Be a fluent/native speaker, ideally. The problem with Inuktitut is that they wanted a system that would work for every dialect, but didn't realize a couple outlying dialects allowed this particular consonant cluster—the people working on it were fluent, and knew well that there was no such cluster in Nunavut.

        – Draconis
        Sep 28 at 17:25











      • @LancePollard in the case of languages with extensive written corpora (IMHO Sanskrit might qualify) and sizeable dictionaries you can test and verify such assumptions with simple string search/regular expressions.

        – Peteris
        Sep 29 at 7:41






      • 1





        If anyone's interested, the "awkward solution" @Draconis refers to is described here: tusaalanga.ca/node/2520. Essentially they replaced the digraph with <ŋ> so that we have <nŋ> and <ŋŋ> to distinguish the two cases.

        – jogloran
        Sep 29 at 21:13







      • 2





        @jogloran Indeed! It's awkward mostly because it breaks compatibility with the other dialects. And, more controversially, they decided to cease the official use of syllabics in those dialects until they could find a workaround—which breaks compatibility even worse. Eventually I'm assuming the "ng"-ligature used in standard syllabics will get replaced with a new "ŋ" character, and maybe the awkward r/q problems will get fixed too—but it won't be a quick, or easy, change at this point.

        – Draconis
        Sep 29 at 21:17















      9



















      A transliteration system is usually either designed to be lossless, or not. To know whether it is or not, you have to know the target language.



      Lossless transliteration systems generally have to use one of four methods to stay unambiguous:



      • Don't use digraphs at all. Write every phoneme with a single character: ŋ instead of ng, x instead of kh, þ instead of th, etc.

      • Use a letter for digraphs that never appears elsewhere. Some transliteration systems for Russian reserve h for digraph use, so that kh, ch, sh are unambiguous: there's no such thing as an h on its own.

      • Use digraphs that are illegal consonant clusters in the language. Ancient Greek (Attic dialect at least) had all of /t/, /h/, and /tʰ/—but transcribing them t, h, th is unambiguous, since /th/ can never occur (depending on your analysis of words like μέθοδος).

      • Add a special way to disambiguate between the two. Swahili has both /ⁿg/ and /ŋ/; the former is written ng, the latter ng'. The "library transliteration" of Arabic uses th for /θ/, and t'h for /th/.

      The first is sometimes considered cleanest, but tends to very quickly exceed the limits of ASCII.



      The second works well for certain digraphs, less well for others: a language with /ŋ/ probably also has /n/ and /g/ already.



      The third works great until you discover that your assumptions about what's illegal were wrong! This happened famously in Inuktitut: the orthography was designed with the assumption that the sequence /nŋ/ was illegal, so they use the equivalent of nng for a geminate /ŋŋ/. Except then some lesser-known dialects do have /nŋ/, and they had to retrofit in an awkward solution. Oops.



      The fourth is the easiest to retrofit onto an existing system, and is fairly widespread. If you aren't already using the apostrophe for something, it's an easy way to fix pretty much any ambiguities that come up.






      share|improve this answer






















      • 1





        #3 is exactly what I was thinking would be the problem of every transliteration. How can you guarantee this is the case for Sanskrit and Tibetan?

        – Lance Pollard
        Sep 28 at 17:24






      • 1





        @LancePollard Be a fluent/native speaker, ideally. The problem with Inuktitut is that they wanted a system that would work for every dialect, but didn't realize a couple outlying dialects allowed this particular consonant cluster—the people working on it were fluent, and knew well that there was no such cluster in Nunavut.

        – Draconis
        Sep 28 at 17:25











      • @LancePollard in the case of languages with extensive written corpora (IMHO Sanskrit might qualify) and sizeable dictionaries you can test and verify such assumptions with simple string search/regular expressions.

        – Peteris
        Sep 29 at 7:41






      • 1





        If anyone's interested, the "awkward solution" @Draconis refers to is described here: tusaalanga.ca/node/2520. Essentially they replaced the digraph with <ŋ> so that we have <nŋ> and <ŋŋ> to distinguish the two cases.

        – jogloran
        Sep 29 at 21:13







      • 2





        @jogloran Indeed! It's awkward mostly because it breaks compatibility with the other dialects. And, more controversially, they decided to cease the official use of syllabics in those dialects until they could find a workaround—which breaks compatibility even worse. Eventually I'm assuming the "ng"-ligature used in standard syllabics will get replaced with a new "ŋ" character, and maybe the awkward r/q problems will get fixed too—but it won't be a quick, or easy, change at this point.

        – Draconis
        Sep 29 at 21:17













      9















      9











      9









      A transliteration system is usually either designed to be lossless, or not. To know whether it is or not, you have to know the target language.



      Lossless transliteration systems generally have to use one of four methods to stay unambiguous:



      • Don't use digraphs at all. Write every phoneme with a single character: ŋ instead of ng, x instead of kh, þ instead of th, etc.

      • Use a letter for digraphs that never appears elsewhere. Some transliteration systems for Russian reserve h for digraph use, so that kh, ch, sh are unambiguous: there's no such thing as an h on its own.

      • Use digraphs that are illegal consonant clusters in the language. Ancient Greek (Attic dialect at least) had all of /t/, /h/, and /tʰ/—but transcribing them t, h, th is unambiguous, since /th/ can never occur (depending on your analysis of words like μέθοδος).

      • Add a special way to disambiguate between the two. Swahili has both /ⁿg/ and /ŋ/; the former is written ng, the latter ng'. The "library transliteration" of Arabic uses th for /θ/, and t'h for /th/.

      The first is sometimes considered cleanest, but tends to very quickly exceed the limits of ASCII.



      The second works well for certain digraphs, less well for others: a language with /ŋ/ probably also has /n/ and /g/ already.



      The third works great until you discover that your assumptions about what's illegal were wrong! This happened famously in Inuktitut: the orthography was designed with the assumption that the sequence /nŋ/ was illegal, so they use the equivalent of nng for a geminate /ŋŋ/. Except then some lesser-known dialects do have /nŋ/, and they had to retrofit in an awkward solution. Oops.



      The fourth is the easiest to retrofit onto an existing system, and is fairly widespread. If you aren't already using the apostrophe for something, it's an easy way to fix pretty much any ambiguities that come up.






      share|improve this answer
















      A transliteration system is usually either designed to be lossless, or not. To know whether it is or not, you have to know the target language.



      Lossless transliteration systems generally have to use one of four methods to stay unambiguous:



      • Don't use digraphs at all. Write every phoneme with a single character: ŋ instead of ng, x instead of kh, þ instead of th, etc.

      • Use a letter for digraphs that never appears elsewhere. Some transliteration systems for Russian reserve h for digraph use, so that kh, ch, sh are unambiguous: there's no such thing as an h on its own.

      • Use digraphs that are illegal consonant clusters in the language. Ancient Greek (Attic dialect at least) had all of /t/, /h/, and /tʰ/—but transcribing them t, h, th is unambiguous, since /th/ can never occur (depending on your analysis of words like μέθοδος).

      • Add a special way to disambiguate between the two. Swahili has both /ⁿg/ and /ŋ/; the former is written ng, the latter ng'. The "library transliteration" of Arabic uses th for /θ/, and t'h for /th/.

      The first is sometimes considered cleanest, but tends to very quickly exceed the limits of ASCII.



      The second works well for certain digraphs, less well for others: a language with /ŋ/ probably also has /n/ and /g/ already.



      The third works great until you discover that your assumptions about what's illegal were wrong! This happened famously in Inuktitut: the orthography was designed with the assumption that the sequence /nŋ/ was illegal, so they use the equivalent of nng for a geminate /ŋŋ/. Except then some lesser-known dialects do have /nŋ/, and they had to retrofit in an awkward solution. Oops.



      The fourth is the easiest to retrofit onto an existing system, and is fairly widespread. If you aren't already using the apostrophe for something, it's an easy way to fix pretty much any ambiguities that come up.







      share|improve this answer















      share|improve this answer




      share|improve this answer








      edited Sep 28 at 17:47

























      answered Sep 28 at 16:53









      DraconisDraconis

      28.6k2 gold badges54 silver badges107 bronze badges




      28.6k2 gold badges54 silver badges107 bronze badges










      • 1





        #3 is exactly what I was thinking would be the problem of every transliteration. How can you guarantee this is the case for Sanskrit and Tibetan?

        – Lance Pollard
        Sep 28 at 17:24






      • 1





        @LancePollard Be a fluent/native speaker, ideally. The problem with Inuktitut is that they wanted a system that would work for every dialect, but didn't realize a couple outlying dialects allowed this particular consonant cluster—the people working on it were fluent, and knew well that there was no such cluster in Nunavut.

        – Draconis
        Sep 28 at 17:25











      • @LancePollard in the case of languages with extensive written corpora (IMHO Sanskrit might qualify) and sizeable dictionaries you can test and verify such assumptions with simple string search/regular expressions.

        – Peteris
        Sep 29 at 7:41






      • 1





        If anyone's interested, the "awkward solution" @Draconis refers to is described here: tusaalanga.ca/node/2520. Essentially they replaced the digraph with <ŋ> so that we have <nŋ> and <ŋŋ> to distinguish the two cases.

        – jogloran
        Sep 29 at 21:13







      • 2





        @jogloran Indeed! It's awkward mostly because it breaks compatibility with the other dialects. And, more controversially, they decided to cease the official use of syllabics in those dialects until they could find a workaround—which breaks compatibility even worse. Eventually I'm assuming the "ng"-ligature used in standard syllabics will get replaced with a new "ŋ" character, and maybe the awkward r/q problems will get fixed too—but it won't be a quick, or easy, change at this point.

        – Draconis
        Sep 29 at 21:17












      • 1





        #3 is exactly what I was thinking would be the problem of every transliteration. How can you guarantee this is the case for Sanskrit and Tibetan?

        – Lance Pollard
        Sep 28 at 17:24






      • 1





        @LancePollard Be a fluent/native speaker, ideally. The problem with Inuktitut is that they wanted a system that would work for every dialect, but didn't realize a couple outlying dialects allowed this particular consonant cluster—the people working on it were fluent, and knew well that there was no such cluster in Nunavut.

        – Draconis
        Sep 28 at 17:25











      • @LancePollard in the case of languages with extensive written corpora (IMHO Sanskrit might qualify) and sizeable dictionaries you can test and verify such assumptions with simple string search/regular expressions.

        – Peteris
        Sep 29 at 7:41






      • 1





        If anyone's interested, the "awkward solution" @Draconis refers to is described here: tusaalanga.ca/node/2520. Essentially they replaced the digraph with <ŋ> so that we have <nŋ> and <ŋŋ> to distinguish the two cases.

        – jogloran
        Sep 29 at 21:13







      • 2





        @jogloran Indeed! It's awkward mostly because it breaks compatibility with the other dialects. And, more controversially, they decided to cease the official use of syllabics in those dialects until they could find a workaround—which breaks compatibility even worse. Eventually I'm assuming the "ng"-ligature used in standard syllabics will get replaced with a new "ŋ" character, and maybe the awkward r/q problems will get fixed too—but it won't be a quick, or easy, change at this point.

        – Draconis
        Sep 29 at 21:17







      1




      1





      #3 is exactly what I was thinking would be the problem of every transliteration. How can you guarantee this is the case for Sanskrit and Tibetan?

      – Lance Pollard
      Sep 28 at 17:24





      #3 is exactly what I was thinking would be the problem of every transliteration. How can you guarantee this is the case for Sanskrit and Tibetan?

      – Lance Pollard
      Sep 28 at 17:24




      1




      1





      @LancePollard Be a fluent/native speaker, ideally. The problem with Inuktitut is that they wanted a system that would work for every dialect, but didn't realize a couple outlying dialects allowed this particular consonant cluster—the people working on it were fluent, and knew well that there was no such cluster in Nunavut.

      – Draconis
      Sep 28 at 17:25





      @LancePollard Be a fluent/native speaker, ideally. The problem with Inuktitut is that they wanted a system that would work for every dialect, but didn't realize a couple outlying dialects allowed this particular consonant cluster—the people working on it were fluent, and knew well that there was no such cluster in Nunavut.

      – Draconis
      Sep 28 at 17:25













      @LancePollard in the case of languages with extensive written corpora (IMHO Sanskrit might qualify) and sizeable dictionaries you can test and verify such assumptions with simple string search/regular expressions.

      – Peteris
      Sep 29 at 7:41





      @LancePollard in the case of languages with extensive written corpora (IMHO Sanskrit might qualify) and sizeable dictionaries you can test and verify such assumptions with simple string search/regular expressions.

      – Peteris
      Sep 29 at 7:41




      1




      1





      If anyone's interested, the "awkward solution" @Draconis refers to is described here: tusaalanga.ca/node/2520. Essentially they replaced the digraph with <ŋ> so that we have <nŋ> and <ŋŋ> to distinguish the two cases.

      – jogloran
      Sep 29 at 21:13






      If anyone's interested, the "awkward solution" @Draconis refers to is described here: tusaalanga.ca/node/2520. Essentially they replaced the digraph with <ŋ> so that we have <nŋ> and <ŋŋ> to distinguish the two cases.

      – jogloran
      Sep 29 at 21:13





      2




      2





      @jogloran Indeed! It's awkward mostly because it breaks compatibility with the other dialects. And, more controversially, they decided to cease the official use of syllabics in those dialects until they could find a workaround—which breaks compatibility even worse. Eventually I'm assuming the "ng"-ligature used in standard syllabics will get replaced with a new "ŋ" character, and maybe the awkward r/q problems will get fixed too—but it won't be a quick, or easy, change at this point.

      – Draconis
      Sep 29 at 21:17





      @jogloran Indeed! It's awkward mostly because it breaks compatibility with the other dialects. And, more controversially, they decided to cease the official use of syllabics in those dialects until they could find a workaround—which breaks compatibility even worse. Eventually I'm assuming the "ng"-ligature used in standard syllabics will get replaced with a new "ŋ" character, and maybe the awkward r/q problems will get fixed too—but it won't be a quick, or easy, change at this point.

      – Draconis
      Sep 29 at 21:17













      2



















      Regarding Wylie, the problem you describe is part of the issue noted in the Wikipedia article you linked:




      Wylie's original scheme is not capable of transliterating all Tibetan-script texts. In particular, it has no correspondences for most Tibetan punctuation symbols, and lacks the ability to represent non-Tibetan words written in Tibetan script...




      I believe the EWTS variant addresses most of the ambiguities, but transliterating to/from Wylie inherently requires knowledge of Tibetan orthographic rules for identification of root letter and which stacks, prefixes, and first- and second-suffixes are valid. For at least a few three-letter words with no vowel mark, the rule is essentially just a special-case for the particular word.



      So indeed, this kind of transliteration system is very limited and suffers from the problems you expected. Other transliteration systems like Romaji (at least as I understand it) don't; the difference is ability to preserve character boundaries unambiguously.



      Arguably, such transliteration systems are obsolete in the age of Unicode anyway.






      share|improve this answer





























        2



















        Regarding Wylie, the problem you describe is part of the issue noted in the Wikipedia article you linked:




        Wylie's original scheme is not capable of transliterating all Tibetan-script texts. In particular, it has no correspondences for most Tibetan punctuation symbols, and lacks the ability to represent non-Tibetan words written in Tibetan script...




        I believe the EWTS variant addresses most of the ambiguities, but transliterating to/from Wylie inherently requires knowledge of Tibetan orthographic rules for identification of root letter and which stacks, prefixes, and first- and second-suffixes are valid. For at least a few three-letter words with no vowel mark, the rule is essentially just a special-case for the particular word.



        So indeed, this kind of transliteration system is very limited and suffers from the problems you expected. Other transliteration systems like Romaji (at least as I understand it) don't; the difference is ability to preserve character boundaries unambiguously.



        Arguably, such transliteration systems are obsolete in the age of Unicode anyway.






        share|improve this answer



























          2















          2











          2









          Regarding Wylie, the problem you describe is part of the issue noted in the Wikipedia article you linked:




          Wylie's original scheme is not capable of transliterating all Tibetan-script texts. In particular, it has no correspondences for most Tibetan punctuation symbols, and lacks the ability to represent non-Tibetan words written in Tibetan script...




          I believe the EWTS variant addresses most of the ambiguities, but transliterating to/from Wylie inherently requires knowledge of Tibetan orthographic rules for identification of root letter and which stacks, prefixes, and first- and second-suffixes are valid. For at least a few three-letter words with no vowel mark, the rule is essentially just a special-case for the particular word.



          So indeed, this kind of transliteration system is very limited and suffers from the problems you expected. Other transliteration systems like Romaji (at least as I understand it) don't; the difference is ability to preserve character boundaries unambiguously.



          Arguably, such transliteration systems are obsolete in the age of Unicode anyway.






          share|improve this answer














          Regarding Wylie, the problem you describe is part of the issue noted in the Wikipedia article you linked:




          Wylie's original scheme is not capable of transliterating all Tibetan-script texts. In particular, it has no correspondences for most Tibetan punctuation symbols, and lacks the ability to represent non-Tibetan words written in Tibetan script...




          I believe the EWTS variant addresses most of the ambiguities, but transliterating to/from Wylie inherently requires knowledge of Tibetan orthographic rules for identification of root letter and which stacks, prefixes, and first- and second-suffixes are valid. For at least a few three-letter words with no vowel mark, the rule is essentially just a special-case for the particular word.



          So indeed, this kind of transliteration system is very limited and suffers from the problems you expected. Other transliteration systems like Romaji (at least as I understand it) don't; the difference is ability to preserve character boundaries unambiguously.



          Arguably, such transliteration systems are obsolete in the age of Unicode anyway.







          share|improve this answer













          share|improve this answer




          share|improve this answer










          answered Sep 29 at 3:52









          R..R..

          1212 bronze badges




          1212 bronze badges
























              1



















              Sanskrit is lossless. तह is romanized taha, and there is no cluster th distinct from the aspirated consonant romanized as th, spelled थ. You omitted virama in your spelling of bare "t", i.e. त्.
              You can't later discover that there is "t+h" in Sanskrit because there isn't, though you could wonder, how would the Sanskrit grammarians doing fieldwork on Arabic render the cluster "th". Maybe they would write त्ह. It is possible that problems arise in transcribing grammarian metalanguage, which massively violates the rules of Sanskrit. Since I guess you didn't know that there is no t+h cluster in Sanskrit, that relates to how you'd know if a system is lossless – you have to know the target language, and compare the facts of the language to what you know about spelling. I conjecture that North Saami is lossless w.r.t. pronunciation of written words, up to the point of social indeterminacy (are Norwegian u and y adopted into the language with the same vowel or different vowels?).






              share|improve this answer





























                1



















                Sanskrit is lossless. तह is romanized taha, and there is no cluster th distinct from the aspirated consonant romanized as th, spelled थ. You omitted virama in your spelling of bare "t", i.e. त्.
                You can't later discover that there is "t+h" in Sanskrit because there isn't, though you could wonder, how would the Sanskrit grammarians doing fieldwork on Arabic render the cluster "th". Maybe they would write त्ह. It is possible that problems arise in transcribing grammarian metalanguage, which massively violates the rules of Sanskrit. Since I guess you didn't know that there is no t+h cluster in Sanskrit, that relates to how you'd know if a system is lossless – you have to know the target language, and compare the facts of the language to what you know about spelling. I conjecture that North Saami is lossless w.r.t. pronunciation of written words, up to the point of social indeterminacy (are Norwegian u and y adopted into the language with the same vowel or different vowels?).






                share|improve this answer



























                  1















                  1











                  1









                  Sanskrit is lossless. तह is romanized taha, and there is no cluster th distinct from the aspirated consonant romanized as th, spelled थ. You omitted virama in your spelling of bare "t", i.e. त्.
                  You can't later discover that there is "t+h" in Sanskrit because there isn't, though you could wonder, how would the Sanskrit grammarians doing fieldwork on Arabic render the cluster "th". Maybe they would write त्ह. It is possible that problems arise in transcribing grammarian metalanguage, which massively violates the rules of Sanskrit. Since I guess you didn't know that there is no t+h cluster in Sanskrit, that relates to how you'd know if a system is lossless – you have to know the target language, and compare the facts of the language to what you know about spelling. I conjecture that North Saami is lossless w.r.t. pronunciation of written words, up to the point of social indeterminacy (are Norwegian u and y adopted into the language with the same vowel or different vowels?).






                  share|improve this answer














                  Sanskrit is lossless. तह is romanized taha, and there is no cluster th distinct from the aspirated consonant romanized as th, spelled थ. You omitted virama in your spelling of bare "t", i.e. त्.
                  You can't later discover that there is "t+h" in Sanskrit because there isn't, though you could wonder, how would the Sanskrit grammarians doing fieldwork on Arabic render the cluster "th". Maybe they would write त्ह. It is possible that problems arise in transcribing grammarian metalanguage, which massively violates the rules of Sanskrit. Since I guess you didn't know that there is no t+h cluster in Sanskrit, that relates to how you'd know if a system is lossless – you have to know the target language, and compare the facts of the language to what you know about spelling. I conjecture that North Saami is lossless w.r.t. pronunciation of written words, up to the point of social indeterminacy (are Norwegian u and y adopted into the language with the same vowel or different vowels?).







                  share|improve this answer













                  share|improve this answer




                  share|improve this answer










                  answered Sep 28 at 16:06









                  user6726user6726

                  42.1k1 gold badge28 silver badges84 bronze badges




                  42.1k1 gold badge28 silver badges84 bronze badges































                      draft saved

                      draft discarded















































                      Thanks for contributing an answer to Linguistics Stack Exchange!


                      • Please be sure to answer the question. Provide details and share your research!

                      But avoid


                      • Asking for help, clarification, or responding to other answers.

                      • Making statements based on opinion; back them up with references or personal experience.

                      To learn more, see our tips on writing great answers.




                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function ()
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2flinguistics.stackexchange.com%2fquestions%2f33674%2fhow-do-we-know-for-sure-a-transliteration-is-lossless%23new-answer', 'question_page');

                      );

                      Post as a guest















                      Required, but never shown





















































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown

































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown









                      Popular posts from this blog

                      Distance measures on a map of a game The 2019 Stack Overflow Developer Survey Results Are Inmin distance in a graphShortest distance path on contour plotHow to plot a tilted map?Finding points outside of a diskDelaunay link distanceAnnulus from GeoDisks: drawing a ring on a mapNegative Correlation DistanceFind distance along a path (GPS coordinates)Finding position at given distance in a GeoPathMathematics behind distance estimation using camera

                      How to get a smooth, uniform ParametricPlot of a 2D Region?How to plot a complicated Region?How to exclude a region from ParametricPlotHow discretize a region placing vertices on a specific non-uniform gridHow to transform a Plot or a ParametricPlot into a RegionHow can I get a smooth plot of a bounded region?Smooth ParametricPlot3D with RegionFunction?Smooth border of a region ParametricPlotSmooth region boundarySmooth region plot from list of pointsGet minimum y of a certain x in a region

                      Genealogie vun de Merowenger Vum Merowech bis zum Chilperich I. | Navigatiounsmenü