vendredi 14 août 2020

Python: How to randomize numbers within strings where the incoming format is not known?

For a NLP project i need to generate randomized number strings for training purpose, based on training examples. Numbers come as strings (from OCR). Let me restrict the problem statement here to percentage values, where so far observed formats include the following formats or any meaningful combination of the pointed-out format features:

'60'       # no percentage sign, precision 0, no other characters
'60.00'    # no percentage sign, precision 2, dot for digit separation
'60,000'   # no percentage sign, precision 3, comma for digit separation
'60.0000'  # no percentage sign, precision 4, dot for digit separation
'60.00%'   # same as above, with percentage sign
'60.00 %'  # same as above, with whitespace
'100%'     # three digits, zero precision, percentage sign
'5'        # single digit
'% 60'     # percentage sign in front of the number, whitespace

my goal is to randomize the number while preserving the per-character-format (exception: due to different amount in digits when a 5.6 could be randomized to 18.7 or 100.0 and vice versa). The percentage number value should lie between 0 and 100. A few examples how i need it:

input  = '5'  # integer-like digit
output = [  '7', 
           '18', 
          '100'] 

input  =  '100.00 %' # 2-precision float with whitespace & percentage sign
output = [  '5.38 %', 
           '38.05 %', 
          '100.00 %']  

inpput =  '% 60,000' # percentage sign, whitespace, 4-precision float, comma separator
output = ['% 5,5348', 
          '% 48,7849', 
          '% 100,0000'] 

How could I do this? The solution can be both conceptual or a code example. The solution needs to reflect possible formats that may appear in the real data

The best i know so far is to brute-force handwrite if-clauses for every format variation i can come up with.




Aucun commentaire:

Enregistrer un commentaire