U &ºch!ã8@sîdZddlZddlZddlZddlZddlmZe e ¡Z dddœZdd idd idœZddiZ dd ddddddddddddddddddd d!d"d#d$d%d&d'd(d)d*d+d,d-d.d/d0d1d2d3d4d5d6d7d8d9d:d;dd?d@dAdBdCœ7ZdDdE„ZGdFdG„dGeƒZdS)Hz)Tokenization classes for Salesforce CTRL.éNé)ÚPreTrainedTokenizerz vocab.jsonz merges.txt)Ú vocab_fileÚmerges_fileZctrlzHhttps://raw.githubusercontent.com/salesforce/ctrl/master/ctrl-vocab.jsonzHhttps://raw.githubusercontent.com/salesforce/ctrl/master/ctrl-merges.txtéiµ’iûi·ŸiÐ÷i»öi#jiñviµ~i6²iÅÁivÌiòiØ.iïiè½i×šiÍ¨i§¯i%æi¦iøi3iR-iniS.iKiñiwÌiÁ´i[i*i¡“iœìiÚ/iè?iñíin1iipi€i„iòÉiÏ’i i)i-‘iœ(iºøi™KiîÕiŒiÇ¢i iÄhi–õ)7Z PregnancyZChristianityZExplainZFitnessZSavingZAskZAssZJokeZ QuestionsZThoughtsZRetailZFeminismZWritingZAtheismZNetflixZ ComputingZOpinionZAloneÚFunnyZGamingZHumanZIndiaZJokerZDietZLegalZNormanZTipZWeightZMoviesZRunningZScienceZHorrorZ ConfessionZFinanceZPoliticsZScaryZSupportZTechnologiesZTeenageÚEventZLearnedZNotionZ WikipediaZBooksZExtractZConfessionsZ ConspiracyZLinksZ NarcissusZRelationshipZ RelationshipsZReviewsZNewsZTranslationZmultilingualcCs>tƒ}|d}|dd…D]}| ||f¡|}qt|ƒ}|S)z€Return set of symbol pairs in a word. Word is represented as tuple of symbols (symbols being variable-length strings). rrN)ÚsetÚadd)ÚwordÚpairsZ prev_charÚchar©rúB/tmp/pip-unpacked-wheel-ymerj3tt/transformers/tokenization_ctrl.pyÚ get_pairsfsrcsveZdZdZeZeZeZ e Zd‡fdd„ Ze dd„ƒZdd„Zd d „Zdd„Zd d„Zdd„Zdd„Zdd„Z‡ZS)Ú CTRLTokenizera„ Constructs a CTRL tokenizer. Peculiarities: - Byte-Pair-Encoding This tokenizer inherits from :class:`~transformers.PreTrainedTokenizer` which contains most of the methods. Users should refer to the superclass for more information regarding methods. Args: vocab_file (:obj:`str`): Path to the vocabulary file. merges_file (:obj:`str`): Path to the merges file. unk_token (:obj:`string`, `optional`, defaults to ""): The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead. úc s®tƒjfd|i|—Žt|dd}t |¡|_W5QRXdd„|j ¡Dƒ|_t|dd}| ¡ d¡dd…}W5QRXd d „|Dƒ}t t|tt |ƒƒƒƒ|_i|_dS)NÚ unk_tokenúutf-8©ÚencodingcSsi|]\}}||“qSrr)Ú.0ÚkÚvrrrÚ ’sz*CTRLTokenizer.__init__..Ú réÿÿÿÿcSsg|]}t| ¡ƒ‘qSr)ÚtupleÚsplit)rÚmergerrrÚ •sz*CTRLTokenizer.__init__..)ÚsuperÚ__init__ÚopenÚjsonÚloadÚencoderÚitemsÚdecoderÚreadrÚdictÚzipÚrangeÚlenÚ bpe_ranksÚcache)ÚselfrrrÚkwargsZvocab_handleZ merges_handleZmerges©Ú __class__rrr"s zCTRLTokenizer.__init__cCs t|jƒS©N)r-r&©r0rrrÚ vocab_size™szCTRLTokenizer.vocab_sizecCst|jf|jŽSr4)r*r&Zadded_tokens_encoderr5rrrÚ get_vocabszCTRLTokenizer.get_vocabc s’|ˆjkrˆj|St|ƒ}tt|dd…ƒ|ddgƒ}t|ƒ}|sN|St|‡fdd„d}|ˆjkrpqn|\}}g}d}|t|ƒkrDz| ||¡} Wn,tk rÊ| ||d…¡YqDYnX| ||| …¡| }|||kr,|t|ƒdkr,||d|kr,| ||¡|d7}q€| ||¡|d7}q€t|ƒ}|}t|ƒdkrdqnqNt|ƒ}qNd |¡}|dd …}|ˆj|<|S)Nrzcsˆj |tdƒ¡S)NÚinf)r.ÚgetÚfloat)Úpairr5rrÚ«óz#CTRLTokenizer.bpe..©Úkeyrréú@@ éüÿÿÿ)r/rÚlistrÚminr.r-ÚindexÚ ValueErrorÚextendÚappendÚjoin) r0ÚtokenrrZbigramÚfirstÚsecondZnew_wordÚiÚjrr5rÚbpe sF " 2 zCTRLTokenizer.bpecCs>g}t d|¡}|D]$}| dd„| |¡ d¡Dƒ¡q|S)z Tokenize a string. z\S+\n?cSsg|]}|‘qSrr)rÚtrrrr Ôsz+CTRLTokenizer._tokenize..ú )ÚreÚfindallrGrOr)r0ÚtextZsplit_tokensÚwordsrJrrrÚ _tokenizeÌs "zCTRLTokenizer._tokenizecCs|j ||j |j¡¡S)z2 Converts a token (str) in an id using the vocab. )r&r9r)r0rJrrrÚ_convert_token_to_id×sz"CTRLTokenizer._convert_token_to_idcCs|j ||j¡S)z=Converts an index (integer) in a token (str) using the vocab.)r(r9r)r0rErrrÚ_convert_id_to_tokenÛsz"CTRLTokenizer._convert_id_to_tokencCsd |¡ dd¡ ¡}|S)z< Converts a sequence of tokens (string) in a single string. rQrAÚ)rIÚreplaceÚstrip)r0ÚtokensZ out_stringrrrÚconvert_tokens_to_stringßsz&CTRLTokenizer.convert_tokens_to_stringc Csütj |¡s t d |¡¡dStj |td¡}tj |td¡}t|ddd}| t j|jdd ¡W5QRXd }t|dddh}| d¡t |j ¡dd „dD]@\}}||krÌt d |¡¡|}| d |¡d¡|d7}q¨W5QRX||fS)a Save the vocabulary and special tokens file to a directory. Args: save_directory (:obj:`str`): The directory in which to save the vocabulary. Returns: :obj:`Tuple(str)`: Paths to the files saved. z*Vocabulary path ({}) should be a directoryNrrÚwrrF)Úensure_asciirz#version: 0.2 cSs|dS)Nrr)Úkvrrrr<ûr=z/CTRLTokenizer.save_vocabulary..r>zqSaving vocabulary to {}: BPE merge indices are not consecutive. Please check that the tokenizer is not corrupted!rQrr)ÚosÚpathÚisdirÚloggerÚerrorÚformatrIÚVOCAB_FILES_NAMESr#Úwriter$Údumpsr&Úsortedr.r'Úwarning) r0Zsave_directoryrZ merge_fileÚfrEÚwriterZ bpe_tokensZtoken_indexrrrÚsave_vocabularyäs* ÿÿzCTRLTokenizer.save_vocabulary)r)Ú__name__Ú __module__Ú__qualname__Ú__doc__rgZvocab_files_namesÚPRETRAINED_VOCAB_FILES_MAPZpretrained_vocab_files_mapÚ&PRETRAINED_POSITIONAL_EMBEDDINGS_SIZESZmax_model_input_sizesÚ CONTROL_CODESZ control_codesr"Úpropertyr6r7rOrVrWrXr]rnÚ __classcell__rrr2rrus ,r)rrr$ÚloggingraÚregexrRZtokenization_utilsrÚ getLoggerrordrgrsrtrurrrrrrÚs’ þþÿÉ;