logged in as: Guest
Instructions
click on instructions link in search bar to access these instructions at any time
click on red text to expand and collapse information
General Information about arabiCorpus
- Searches large, untagged arabic corpora for words you type in.
- Searches for EXACTLY the string you type in (and nothing else).
- Corpora can be searched singly or in various combinations.
- Some filtering of results available.
- Easy to search for individual words.
- Ability to search for multiple words at once.
- Accepts most of regular expression language.
- Use the Tutorial below to help you with your initial searches.
Searching arabiCorpus
- Basic Searching Instructions
- Type a search word or phrase into one of the two boxes on the left. Type it WITHOUT vowels and without prefixes like the definite article.
- EITHER: type transliteration into the 'latin chars' box, using the DT transliteration system (click 'transliteration help' to see chart).
- OR, type Arabic script into the 'arabic chars' box (your computer must be set up to type Arabic script).
- Choose a part of speech filter. If you don't want to use a filter, choose 'string'.
- Choose a corpus to search or one of the corpus combinations.
- Click on 'submit'.
- If you haven't chosen a corpus or a part of speech filter, a little red message will warn you to do so, and no search will be performed.
- Detailed Searching Instructions
- Search Word
- Type word, phrase or regular expression into one of the two leftmost boxes, not both.
- Click on 'transliteration help' to see transliteration chart; note there is a one to one correspondence between Arabic and transliterated letters.
- To type Arabic, make sure your computer is set up to type Arabic, and use the regular computer input method.
- A space means a space. If you type two or more words it will look for the exact phrase.
- The program automatically collapses initial hamzas, meaning that typing L or A or E will end of finding all of them.
- To find all the examples of what you are looking for, you must be aware of Arabic spelling variations. For example:
- Final yaa' varies with final yaa' without the two dots: لبناني لبنانى
- So, if you search for يلي you will find only that, and no occurances of يلى and vice versa.
- If you want to find both يلى and يلى at the same time, you have to build it into your search (see Advanced Search Capabilities below).
- Choose Corpus
- The newspapers in the list currently include one year of Al-Ahram (1999-Egypt), two years of Al-Hayat in separate corpora (1996, 1997-London-Saudi-Lebanon), and a half year each of At-Tajdid (2002-Morocco) and Al-Watan (2002-Kuwait), as well as approximately a year of Al-Thawra (Syria, after 2000) and the Penn Treebank news data (which is not included in the All Newspapers category). The premodern corpus includes the Quran, 1001 Nights, and some works of medieval philosophy and medicine (Al-Ghazali's Incoherence of the Philosophers, Averoes Middle Commentary, and medical works by Maimonides including On Asthma and Medical Aphorisms). The modern Literature corpus includes a number of Arabic novels (see list below), a couple of plays and some short stories. The non-fiction category is currently small, and includes some literary criticism, other scholarly (and not so scholarly) works, some political speeches, and some official UN and other diplomatic documents. The Egyptian Arabic category is also small and includes a couple of literary works that include a lot of colloquial, and some material from the Egypt Chat web site.
- The combined categories are relatively self-explanatory. Most of the smaller works (novels, plays, non-fiction, premodern except the Quran and 1001 Nights are not individually searchable, but only as a group. Some works are in more than one group; for example, a colloquial play is in both Modern Literature and Egyptian Arabic, and all the novels are in boh the novel and the Modern Literature category.
- Here are the novels currently in the corpus. The countries represented are Egypt, Palestine, Algeria, Saudi Arabia, the Sudan, Syria, Lebanon. Currently about half of the material in the Modern Literature corpus is from Egypt, about a fourth from Algeria, and lesser amounts from the remainder. Please note that we are currently (Oct 2007) adding works to the corpus on a slow but regular basis, and that this list gets updated less often than works are added, so there may be more than is listed here when you do your search, and the numbers listed below could also be low for the same reason.
- ريم بسيوني: رائحة البحر
- علي سالم: أولادنا في لندن
- إبراهيم عبد المجيد: لا أحد ينام في الاسكندرية
- علاء الأسواني: عمارة يعقوبيان
- ريم بسيوني: مدبولي
- خالد الخميسي: تاكسي
- علاء الأسواني: شيكاجو
- نجيب محفوظ: ميرامار
- نجيب محفوظ: الكرنك
- أحلام مستغانمي: ذاكرة الجسد
- رجاء عبدالله الصانع: بنات الرياض
- الطاهر وطار: الولي الطاهر يعود إلى مقامه الزكي
- الطاهر وطار: الولي الطاهر يرفع يديه بالدعاء
- التطاهر وطار: الحوات والقصر
- نجيب محفوظ: صدى النسيان
- نجيب محفوظ: أصداء السيرة الذاتية
- الطيب صالح: عرس الزين
- إدوار الخرات: ترابها زعفران
- لطيفة الزيات: الشيخوخة وقصص أخرى
- يحيى حقي: قصص ليحيى حقي
- إلياس خوري: مملكة الغرباء
- أحلام مستغانمي: عابر سرير
- غسان كنفاني: أم سعد
- أحلام مستغانمي: فوضى الحواس
- نجاة حالو: سر الحياة
- سعد الله ونوس: مغامرة رأس المملوك جابر
- تميم صائب: لا تفقأ عينيك يا أوديب
- غسان كنفاني: مسرحية الباب
- غادة السمان: ختم َلذاكرة بالشمع الأحمر
- غسان كنفاني: عائد إلى حيفا
- نجيب محفوظ: أولاد حارتنا
- The following are the word totals for the various corpora in the site:
- Ahram99: 16475979
- Hayat97: 19473315
- Hayat96: 21564239
- Tajdid02: 2919782
- Watan02: 6454411
- Thawra: 16631975
- Novels: 928776
- Modern Literature: 1001899
- Premodern: 912996
- Nonfiction: 579545
- Medieval Science: 223249
- Quran: 84532
- 1001Nights: 557908
- Choosing the combinations will often make your 'wait' time MUCH longer.
- If you are searching for an uncommon word, using the combinations can be valuable, since this will give you more examples.
- If you are searching for a common word, using the combinations can be unhelpful; besides taking a long time to return, it will return so many results (10s of thousands) that you will not be able to deal with them efficiently.
- Part of Speech Filters
- String
- String will not apply a filter to the results.
- EVERYTHING that matches your string will be returned, no matter what is before or after it.
- If you type 'ktb' (كتب) you will get EVERY example of that string in the corpus, including the following:
- Alktb الكتب
- ktbtm كتبتم
- fyktb فيكتب
- AlmktbQ المكتبة
- ystktb يستكتب
- ktbrcAt كتبرعات
- Other Part of Speech Filters (Read this before reading the choices below)
- Any part of speech filter (POS) other than 'string' will filter your results to the:
- search string alone, with only space or punctuation before or after
- search string with suffixes and prefixes that go with the chosen POS
- all other instances of the string will be filtered out (those that are not the bare form, or with the known suffixes and prefixes)
- Any part of speech filter (POS) other than 'string' will filter your results to the:
- Noun
- Choose 'noun' and the program will accept:
- the bare seacrch string you typed in
- the conjunctions wa- و and fa- فـ
- the attached prepositions ka- كـ and bi- بـ and li- لـ
- the definite article (ال)
- the pronoun endings
- the alif tanwiin at the end marking indefinite accusative
- the dual ending markers
- If you type in ktb كتب, it will accept (for example):
- ktb كتب
- Alktb الكتب
- wAlktb والكتب
- llktb للكتب
- bAlktb بالكتب
- ktbh كتبه
- ktbnA كتبنا
- ktbhn كتبهن
- bktbhA بكتبها
- The program would also accept ktbAn if it found it, since it assumes you are typing in a singular, and it looks for possible dual endings added to it.
- but it will not accept (it will filter out):
- ktbtm كتبتم
- fyktb فيكتب
- ystktb يستكتب
- ktbrcAt كتبرعات
- If you type a noun that ends with the feminine marker (Q ة), the program knows to allow forms that have a 't' instead when there are pronoun endings
- If you type ktAbQ كتابة it will accept:
- ktAbQ كتابة
- AlktAbQ الكتابة
- ktAbthA كتابتها
- bktAbtnA بكتابتنا
- The fact that you have filtered your results with the Noun filter does NOT mean that all your results will be nouns. It only means the given the morphological ambiguity of Arabic, they COULD be nouns.
- For example, if you type in ktb كتب and choose 'Noun', some of the results will be unambiguously nouns:
- Alktb الكتب
- bktbhA بكتبها
- while others will simply be ambigous:
- ktb كتب (could be 'books' or 'he wrote')
- ktbnA كتبنا (could be 'our books' or 'we wrote')
- ktbh كتبه (could be 'his books' or 'he wrote it')
- It WILL filter out forms that unambiguously are NOT nouns, which can be very helpful.
- Choosing a filter, therefore, reduces the number of false hits, but does not eliminate them entirely.
- REMEMBER THAT THIS TOOL CANNOT AND DOES NOT TRY TO OVERCOME THE AMBIGUITIES OF ARABIC MORPHOLOGY, SO IF A FORM IS INHERENTLY AMBIGUOUS, YOU WILL GET IT EVEN THOUGH IN CONTEXT IT IS NOT WHAT YOU ARE LOOKING FOR!
- NOTE: Remember, it only searches for what you type in. If you type in a singular form, it will only search for that. If you want to find the plural forms, you must search for them separately, or at the same time in a combined search.
- Adj
- Choose 'adj' and the program will accept:
- the bare seacrch string you typed in
- the conjunctions wa- و and fa- فـ
- the definite article (ال)
- but NOT the prepositions or the pronoun endings
- the alif tanwiin at the end marking indefinite accusative
- the masc. and fem. dual endings
- with or without a feminine marker (Q ة) ('noun' will NOT do this)
- If you type in jmyl جميل, it will accept (for example):
- jmyl جميل
- Aljmyl الجميل
- jmylQ جميلة
- wAljmylQ والجميلة
- but it will not accept (it will filter out):
- jmylhA جميلها
- bAljmyl بالجميل
- If you want to try to find forms like jmylhA جميلها and bAljmyl بالجميل, you need to search for jmyl جميل as a noun, not as adj.
- Alternatively, you could search directly for jmylhA جميلها and bAljmyl بالجميل as strings, bypassing the POS filters.
- Remember that choosing noun or adj or any other part of speech does not imply a moral commitment on your part that this form actually IS what you are choosing. It only means that you want it to allow that particular set of prefixes and suffixes through the search filter.
- As with nouns, if you want to find the plurals as well you must explicitly include them. It does not happen automatically.
- Adv
- Choose 'adv' and the program will accept:
- the bare seacrch string you typed in
- the conjunctions wa- و and fa- فـ
- nothing else
- This category is handy for adverbs, but is also useful when you are searching for a specific form, and not an entire conjugation.
- If you want to find the noun ktb كتب only when it has the definite article preceded by b ب (and not all the other possible forms of ktb كتب), type bAlktb بالكتب and choose 'adv'. It will accept:
- bAlktb بالكتب
- fbAlktb فبالكتب
- wbAlktb وبالكتب
- and nothing else. The same technique works if you want to find a specific verb form (nqwl نقول) as opposed to all the conjugations of that verb.
- Remember, choosing 'adv' does not mean you think the word is an adverb. It means that you want the filter to cut out everything except the specific form you typed in, plus that form with wa- or fa-.
- Choosing 'adv' is almost the opposite of choosing 'string'. String accepts every occurance of the string in the corpus, no matter what else surrounds it, and adv accepts basically only what you typed as a complete, isolated word and nothing else.
- Verb
- Choose 'verb' and the program will accept:
- the bare seacrch string you typed in
- the conjunctions wa- و and fa- فـ
- the particle li- لـ
- the future prefix sa- سـ
- the colloquial prefixes Ha- حـ and bi- بـ
- the pronoun endings
- the perfect and imperfect verb conjugation suffixes and prefixes
- IMPORTANT NOTE: The program assumes you will type in the masculine singular (huwa) form of both the past and the present tense, separated by a comma. However, if you only type the first, it will usually guess the second correctly.
- If you type in ktb,yktb كتب،يكتب, it will accept:
- ktb كتب
- ktbt كتبت
- wktbwA وكتبوا
- ktbwhA كتبوها
- ktbth كتبته
- yktb يكتب
- fsyktbwn فسيكتبون
- lnktbhA لنكتبها
- It will also accept some fairly ludicrous forms like:
- flLktbny فلأكتبني
- nktbnA نكتبنا
- since the program more or less mechanically applies the rules without asking what the resulting forms mean.
- The suffixes and prefixes are applied to the exact string(s) you type in and nothing else.
- The program tries to handle double, hollow, defective, assimilated and hamzated verbs correctly, but its analyses are not perfect, so you should check the search strings it actually uses on the Sumamry Page.
- Advanced Search Capabilities
- Searching 'by hand'
- Click on 'advanced search' and then choose 'search by hand' from the three choicessssss.
- In basic search the program does a fairly large amount of guessing and figuring to create the actual search strings it uses to search the corpus.
- In hand search you create the exact search strings the program uses. It doesn't try to second guess you. This gives you more control, but it also allows you to do searches that are relatively ridiculous.
- For example, instead of having a single choice for verbs, there are four parts of speech listed: verb, verb2, verb4, and verb3.
- 'verb' is for verbs in which the perfect and imperfect stems are identical, and for these you simply type in the past tense huwa form.
- 'verb2' is for verbs in which the perfect and imperfect stems are different, and for these you must type in both the huwa past and huwa present tense stems. (You can optionally include the 'y' on the imperfect stem. It will always be deleted.)
- Form IV verbs, Form VII, VIII and X verbs, Form I assimilated verbs, and Form II verbs with initial hamza are among the verbs with a different stem in the perfect and imperfect.
- 'verb4' is for verbs in which the perfect and imperfect both have two stems. In the case of hollow verbs, the stems are perfect-long, perfect-short, imperfect-long, imperfect-short. In the case of doubled verbs, the stems are perfect only one of the doubleds, perfect with both, imperfect with only one, imperfect with both.
- 'verb4' is used exclusively for hollow and doubled verbs.
- Search words for 'verb2' should look like this: Lkrm,ykrm and wSl,ySl.
- Search words for 'verb4' should look like this: qAl,ql,qwl,ql and AHtl,AHtll,Htl,Htll, again with or without the 'y' on the imperfect forms.
- Searching with or without vowels
- The program automatically strips the vowels and kashidas from both the search string and the text before it searches, unless you specifically tell it not to by clicking on the 'include vowels' box. This option is only available in the 'by hand' search window.
- It is a bad idea to search for vowels
- The part of speech filters are designed to work without them, and so will lead to unpredictable results unless you choose 'string'.
- The use of vowels in texts other than the Quran is extremely inconsistent.
- The program only searches for EXACTLY what you type in and nothing else.
- If you type in: kutub كُتُب it will find:
- kutub كُتُب
- but it will not find:
- ktb كتب
- k_tb كـتب
- kt_b كتـب
- kutb كُتب
- ktub كتُب
- kut_ub كُتـُب
- Some of these problems can be overcome with the clever use of regular expressions, but for general purposes, a search with the 'include vowels' box checked will not typically give you what you are looking for.
- The 'include vowels' box is included mainly for people who are specifically investigating the use of vowels and kashidas in Arabic text.
- Note that the vowels in the Quran text are somewhat idiosyncratic, particularly in the order of vowels and shaddas, so care must be taken when searching with vowels in the Quran.
- Searching for words with hamzas
- Words that begin with a hamza present a special challenge for all search engines because texts are incredibly incosistent
- The engine searches only for what you type in.
- If you search for LTfAl أطفال, it will allow:
- LTfAl أطفال
- LTfAlh أطفاله
- AlLTfAl الأطفال
- but NOT:
- ATfAl اطفال
- ATfAlh اطفاله
- AlATfAl الاطفال
- If you type in ATfAl اطفال, the opposite will happen, and you won't get any of the examples with the hamza.
- The only way to guarantee that you get both is to type both in within square brackets: [AL] [اأ]
- Square brackets is the regular expression way of indicating that anything within them can go in that position.
- If you type: [AL]TfAl the program will accept:
- LTfAl أطفال
- LTfAlh أطفاله
- AlLTfAl الأطفال
- ATfAl اطفال
- ATfAlh اطفاله
- AlATfAl الاطفال
- Sometimes there is a hamza on the alif at the beginning of form VIIIs, so to get them all you must type [AEL]xtAr. etc.
- Because of these problems, the program automatically searches for all the initial hamza possibilities (A and L and E and M) when any one of them is typed in.
- In case you DON'T want to do this, i.e. if you are looking only for LTfAl and NOT ATfAl, then on the 'by hand' page you can click on 'differentiate initial hamzas'.
- In that case it will search only for the exact hamza sequence you type in and nothing else.
- Searching for alif maqsuuras, yaa's (and other variations)
- The Ahram99 corpus (which came from the internet) uses the letter yaa' for the alif maqsuura, although not consistently.
- If you want to find ALL examples of the word mqhe مقهى in the Ahram99 corpus, you must type: mqh[ey]
- Likewise, most of the newspapers sometimes leave the dots off words that are supposed to be yaa's. So you must search for them also with [ey] if you want to get everything.
- There are many other such variations you need to take into account to be able to find all of anything. For example the Ahram rarely leaves a space between the two parts of lA bd لا بد, while the Hayat does so fairly consistently.
- Searching for more than one item at once
- If you search for more than one item at once, it will sort the results all together.
- This can be handy, for example, if you want all examples of a noun and its plural
- Type in the items separated by a comma. Be sure NOT to add extra spaces.
- To find 'book' and 'books', type: ktAb,ktb كتاب،كتب and choose noun.
- To find examples of two different ways to say 'during', type xlAl,[LA]VnAC خلال،[أا]ثناء and choose adv.
- If the words you type in go with different part of speech filters (say an adverb and a noun) results will be unpredictable.
- It is not possible to search for more than one word at once when once of the verb parts of speech is chosen, since the comma separated list is used in verbs to list the various stems of the single verb.
- Searching for phrases
- Typing a space bar will cause the program to look for a space.
- This means that if you type two or more words separated by a space, it will look for the entire phrase.
- Most of the part of speech filters make no sense when searching for phrases. However, sometimes they can be made to be helpful.
- Normally you would choose 'string' to simply find the phrase no matter what else is around it.
- If you want to find the exact phrase but want to allow wa- and fa-, then choose adv.
- If you want to allow the last word in the phrase to have pronoun endings, choose noun.
- If you type: lA gbAr clyh لا غبار عليه and choose adv it will allow:
- lA gbAr clyh لا غبار عليه
- flA gbAr clyh فلا غبار عليه
- wlA gbAr clyh ولا غبار عليه
- and nothing else.
- Searching with Regular Expressions
- Using regular expressions can greatly increase the power of your search.
- Some aspects of the regular expression language don't work yet, so be patient. Most parts do work however.
- You can use the backslash characters that mean specific things:
- \w any word character
- \s any space character
- \b a word boundary
- You can use quantifiers with these:
- \w? 0 or 1 word characters
- \w+ 1 or more word characters in a row
- \w* 0 or more word characters in a row
- A? either an A or not
- You can use square brackets to indicate a list of things, one of which can go there, and you can use quantifiers with the list:
- [hny] either a haa' or a nuun or a yaa'
- [hny]+ any combinations of haa', nuun and yaa' in a row
- [aui~]? either a fatha, damma, kasra, shadda or nothing
- You can use ^ at the beginning of this list to make it a list of things that cannot go there:
- [^AL] anything except A or L
- You can use ^ (outside of the brackets) and $ to indicate the beginning and end of a word, respectively:
- ^br (with the filter 'string' chosen) means find every word that begins with br
- At$ (with the filter 'string' chosen) means find every word that ends with At
- You need to think carefully about how your regular expression is going to interact with the part of speech filter you have chosen. For example:
- If you type \bktAb and choose the adv filter, the only thing it will allow is:
- ktAb
- since the adv filter already allows no suffixes, and only allows the wa- and fa- prefixes, which are cut out by your \b restriction.
- If you type \bktAb and choose the noun filter, however, you will get the bare stem and the word with pronoun endings.
- If you want complete control, choose string, and only what you cut out 'by hand' will be cut out.
- Indicate alternation between whole forms using parentheses and the vertical bar. You must match your parentheses or you will get an error.
- To have the program find mdrs, mdrswn, mdrsyn, or mdrsAt, type mdrs(wn|yn|At)? or alternatively mdrs([wy]n|At)?
- Using parentheses and the vertical bar in combination with square brackets and square brackets with the carrot (^) can be a VERY powerful way to design precise searches.
- Searching for punctuation
- This tool currently does not have a way to search for punctuation
- Cutting out results by hand
- Sometimes you know that the ambiguous morphology of a particular form is going to give you thousands of false hits that you don't want to look through to find what you are really looking for.
- For example, if you search for all forms of the verb bne ybny, you will get many examples of the imperative form Abn (almost all of which are really 'son') and of the third person feminine past tense form bnt (many of which are really 'girl')
- You may choose to do the search with those forms deleted so that you will have an easier time going through the remaining citations looking for something specific.
- To cut out forms that otherwise would be found by a search, type the search, a space, one or two dashes, a space, and then a regular expression that indicates what you don't want
- You can use a vertical bar to cut out multiple things.
- To cut out the exact form Abn and nothing else (choose verbd), type: bn,ai -- ^Abn$
- This will cut out Abn but not Abny.
- To cut out bnt no matter what is after it (choose verbd), type: bn,ai -- ^bnt
- This will cut out bnt and also bntk, etc.
- To cut out any form that has the sequence bnt anywhere in it (choose verbd) type: bn,ai -- bnt
- To cut out both Abn and bnt at the same time (choose verbd), type: bn,ai -- bnt|Abn
- To cut out bnt only when it is alone or preceded by wa- or fa-, type ^[wf]?bnt$
- Hopefully you get the idea. Like searching with regular expressions in general, this can be a very powerful tool in helping refine your searches.
- Using Arabic Script in regular expressions
- Regular expressions work in Arabic script as well as they do in English (\w = \و etc.)
- However, most browsers display the backslashes and parenthesis in bizarre and unpredictable ways. If you type it in carefully, it should work, but if you try to read it after you have typed it in, or on the summary page, you will have trouble.
- Using the 'lookup' page
- Another choice on the Advanced Search page is Lookup. This is for looking up plurals of nouns and adjectives and for looking up verb forms of verbs with more than one form. It should not be used for verbs which only have one form.
- Simply type in a singular noun or adjective, or a past tense huwa verb form, and it will see if it has the other forms in its database. If it does, it will do the search based on what it finds. If not, it will tell you it can't find the form in its database. This does not mean the form is not in the corpus. It means that it is not in the list of forms the program happens to have for lookup purposes. If it can't find the form in its database, then you need to go back to basic search or hand search and search the corpus for the form.
- You cannot search for multiple items at once with Lookup. Anything over one item will be ignored.
- The lookup items in the database are based loosely on (but are quite different from) the dictionary files in Buckwalter's Morphological Analyzer.
- The Advanced Squared page
- The advanced squared page is not finished. When it is, it will provide a way for users to make detailed grammatical choices about what they are looking for and what they want to see.
Results- Basic Information about Results
- The results come back after a few seconds wait.
- If you search for a single noun or adjective in a single corpus, it should come back in about 10 seconds
- If you search for multiple nouns, or verbs, or search in the combined corpora, the results will take longer
- If you search for a very common word that produces tens of thousands of hits or more, the program will sometimes balk because the database it is using will limit what can be inserted.
- The results of a search are saved temporarily in a database, and you can access them by clicking on the words that appear in the red bar.
- Summary Page
- This page comes up first automatically after a search, and you can return to it by clicking on 'summary'
- This page gives you summary information about your search:
- the search string you typed in, in both scripts
- the search strings actually used by the program after its 'figuring'
- the corpus you searched
- the time it took the search engine to do the basic search (this will generally be less than the actual time experienced by you, since it doesn't include the time it takes for the server to receive your request or to serve the results back to you).
- the part of speech filter you chose
- the part of speech filter that was actually used
- the number of 'hits'
- what that number translates to in terms of words per 100,000 words of your corpus
- The latter bit of information is useful for comparative purposes. The corpora are of vastly different sizes, so the actual number per corpus may be misleading, but the number per 100,000 words can be compared.
- Citations Page
- Click on 'citations' in the red bar to see the 'hit's with the 10 words before and the 10 words after
- By default, these citations are sorted by the word that appears directly before the word you searched for
- Note that this sorting is done by the whole word, not by the root, so AlktAb, ktAb, and wAlktAb are nowhere near each other.
- The word directly before is repeated at the left hand side of the page so you can quickly glance through the citations and notice patterns, collocations, etc.
- If you would rather see the citations sorted by the word directly after the 'hit', click on the red sentence that says 'sort by word after'
- The citations are shown 100 at a time.
- If your search returned more than 100 citations, they are organized into 'pages' which you can access by clicking on the red numbers at the top.
- If you searched a single corpus, the subsection column will indicate what subsection of the corpus the example comes from.
- The newspapers have self-labeled subsections, so sometimes you need to think to figure out what the codes mean, but they are related to the various sections of the newspaper.
- If you have searched a combined corpus, the subsection column will tell you which specific corpus the example came from.
- If you want to see the exact reference of your example, click on the red 'subsection' heading and it will change to reference.
- The references in many of the corpora are basically incomprehensible, and not very helpful, but the references to some, like the Quran and the novel, are useful.
- The references in the Quran are to Sura and verse.
- The references in the novel are to chapter, section of chapter (divided by stars), and paragraph number.
- The references in 1001 Nights are to Nights and paragraphs (but in an odd kind of way I'll explain if you ask me by e-mail)
- If you want to see more context than the 10 words before and the 10 words after provides, click on the number at the beginning of the citation.
- This will bring up a separate window that will display the whole verse from the Quran, the whole paragraph from the novel and 1001 Nights, and the whole article from the newspapers.
- You may have to use your browser's Arabic script search function to find the place in the larger context where your item appears, since it does not highlight it.
- Subsections Page
- Click on 'subsections' to see the totals for the various subsections of the corpus you searched
- Again, if you searched a single corpus, those subsections will be those defined for that corpus.
- Some corpora, particularly the non-news ones, have no subsections defined.
- If you searched a combined corpus, the subsections will be the names of the corpora in that combined corpus
- The results in this section are ordered from most frequent to less frequent in terms of number
- The number for 100,000 for each subsection is also given.
- This page allows you to compare, say, premodern Arabic with modern, or to compare newspapers from various places
- Type lA bd and choose newspapers, for example, click on subsections, and you will find that the Ahram uses it much less frequently than the Hayat.
- Type lAbd and choose newspapers, though, and you will find the opposite (the Ahram apparently typically types the phrase without a space, and the Hayat does the opposite).
- Word Forms Page
- Click on 'word forms' to see the exact forms that your search produced, orderd by frequency.
- Examining this list can give you important hints about normal usage.
- Examining this list can also help you see easily what kinds of false hits you are getting so you can work to eliminate them
- It is strongly suggested that you examine this list for every search before you take the rest of the results seriously.
- See if there are any forms you expected to find that aren't there
- See if there are any forms you did not expect to find that are there
- See if you can figure out what it is about the expression you typed in (and about Arabic morphology) that would have created these problems
- Click on any word in the word form list and a window will open up with the citations for just that word form.
- Words Before/After Page
- Click on 'words before/after' to see a list of the common words directly before and directly after the 'hit' ordered by frequency.
- Examining these lists is a good way to scope out the main usages and collocations of the word you searched for
- Examining the lists can also help you identify structures that you had not intended to search for that you want to cut out
- If you want to see only the citations with that particular word before or after, click on a word in the list
- A separate window will open showing you just those citations
Miscellaneous- Recommended Browser
- The Firefox browser is recommended, although you should be able to use it on other browsers as well.
- Note About Numbers
- Most of the data for the current corpora were created before unicode was in widespread use. The numbers (the digits), particularly, were entered in a variety of formats (even in the same article) which means that some show up frontwards and some backwards. I found myself unable to fix this, so you will simply have to live with the fact that the numbers might actually be reversed in any particular case.
- Corpus Information
- The corpora in the list currently include one year of Al-Ahram (1999), two years of Al-Hayat in separate corpora (1996, 1997), and a half year each of At-Tajdid (Moroccan) and Al-Watan (Kuwait), the Quran, 1001 Nights, several medieval medical and philosophical texts, 8 Egyptian novels, one Egyptian Arabic play, and some EgyptChat data from the internet, as well as the Penn Treebank news data. Most of the corpora are ony available in combined groups, but the individual newspapers can be searched separately. Choosing to search ALL or NEWSPAPERS (both large combinations of corpora) will often lead to MUCH longer search times. If you combine that with a search for something that is going to produce hundreds of thousands of hits, the machine may balk and not return anything. In general, the ALL category is mainly appropriate for researchers looking for broad quantitative or comparative data. The smaller categories are usually going to be more appropriate for pedagogical data, or data for students, since it will come in a quantity that you can understand and deal with.
- The following are the word totals for the various corpora in the site. Note that you cannot add these for the total number, since the grouped corpora group and regroup in various configurations. For example, the Modern Literature group includes everything in Novels, and adds the play, the Egyptian Colloquial group includes the play and the Chat, etc. (By the way, you should be aware the EgyptChat includes some colloquial and a lot of fusha and mixed fusha and colloquail, while the literature and even the newspapers do include some colloquial.) The total numer of words of the whole corpus is: 68943447.
- Ahram99: 16475979
- Hayat97: 19473315
- Hayat96: 21564239
- Tajdid02: 2919782
- Watan02: 6454411
- Treebank: 598590
- Novels: 387036
- Quran: 84532
- 1001Nights: 557908
- Premodern: 912996
- Medieval Science: 223249
- Egyptian Colloquial: 157099
- Modern Literature: 403901
- Warning
- Among the things you can do to make use of this tool annoying:
- search for very common words or strings in multiple corpora (you have to wait basically forever). When learning to use the tool, try a small corpus (say Treebank) and search for slightly less common words.
- Common errors to avoid:
- typing the noun with an article (the tool automatically looks for nouns and adjectives with and without the article; if you type it with the article it won't find any examples without it)
- choosing the wrong part of speech filter (looking for an adjective as a noun, or looking for a verb4 as verb);
- (by hand page:) not typing the input correctly for verb2, verb4 and verbd (note that these take special input on which see the instructions above; typing qAl and choosing verb4 will NOT give you the results you want);
- choosing noun when looking for a phrase (you should usually choose adv or string).
- (by hand page:) typing a word with one or more vowels, without clicking on 'include vowels'. Best is not to type vowels in the search word. See 'searching with vowels' in the advanced search section.
Announcements- 18 October 2007: A list of the novels currently in the Modern Literature corpus, as well as an updated word total list of the various corpora, has been added. Click on Searching arabiCorpus>Detailed Searching Instructions>Choose Corpus.
- 25 July 2007: More novels, and some non-fiction has also been added. I am behind in listing the information about the new items but it will eventually come. I have also added about 16 million words of Al-Thawra, a Syrian newspaper. The subsections of this newspaper are not as helpful as the other ones in the corpus (since I rely on those already on the site). But it is good to have a pure Syrian source of data for comparative purposes.
- 15 June 2007: Several new novels have been added, and the organization of the corpus menu has been made easier to navigate. Check the Corpus Information link under Miscelaneous to get an updated list of the items included and their word counts.
- 30 June 2006: Several new features have been added, and the general search analysis has been upgraded:
- The various verb choices under part of speech have been combined into one.
- The program assumes you will type the masculine singular (huwa) form for both the past and present like this: ktb,yktb. If you type something different it will make guesses as to what you meant, but could guess wrong. In the case of hollow and doubled verbs, you can also choose to type all four stems as before.
- Crucially, for doubled verbs, you need to type the shadda at the end. This is an exception to the rule that you don't type vowels into the search string. The shadda is stripped before searching, but the program uses it to become aware that what you typed is a doubled verb.
- Vowels typed in are now automatically stripped out.
- You can still search for multiple nouns and adjectives (say a noun and three forms of its plural) at the same time, but the program no longer allows you to search for more than one verb at once.
- The program now automatically searches for duals as well as singulars on all nouns and adjectives (and verbs).
- The program now does a somewhat better job of matching prefixes and suffixes on verbs, so that impossible combinations are weeded out better than before.
- Choices like searching with vowels and not collapsing initial hamzas are no longer available on basic search. Click the advanced serach linnk to find them.
- A new Advanced Search option has been added. Clicking on this takes you to a new page with three options (Note that the instructions available from that page give more details about these options):
- Lookup: On this page you can type in a noun, and it will look up its plural(s) and include them in the search. If you look up an adjective it will do the same, but also include the feminine singular (for example if you type in LHmr, it will automaticallly also look for HmraC). If you choose Hollow, Doubled, Assimilated or one of the other verb forms listed, it will look up all associated forms and create the search string automatically. Note that it does not allow you to look up 'regular' verbs like ktb that don't need to be looked up, since the perfect and imperfect stems are exactly the same. To do that, you need to go to basic search or 'by hand' search.
- Hand: This page is like the original basic search, but without trying to second guess you. It allows you to craft your search however you like, and makes you live with the consequences of inappropriate choices. This is the only place where you can now search with vowels or without collapsing initial hamzas.
- Advanced Squared: This page is under construction and currently doesn't do anything. The plan is to allow you to completely specify the grammatical components of the form you are searching for, and specify which forms you want to find more completely. When finished, this will allow users who are not familiar with regular expressions to tweak their searches more effectively.
- 15 May 2006:I have added a checkbox right under the submit button. Previously, you had to type in specifically whether you wanted to search for an alif at the beginning of a word alone, or with a hamza above, or with a hamza below. Now, no matter which of these (or even an alif with a madda) you type, it will search for all of them, unless you click the checkbox 'differentiate initial hamzas' in which case it will search only for what you type in. This was done because I discovered that most people really do want to see all examples of what they are looking for, but they forget to type in the choice (i.e. the want to see all examples of LwlAd 'boys' but they type either LwlAd or AwlAd, and either would only give them part of the examples. To see all the examples they would have needed to type [AL]wlAd. Now when you type AwlAd or LwlAd it automatically searches for [AL]wlAd unless you click the check box.
- 9 May 2006:The tutorial is done (in its initial version). If you have time, check it out and send me feedback.
- 9 May 2006: Two short books by Maimonedes on the practice of medicine (Asthma and Aphorisms) have been added to the Premodern corpora.
- 5 May 2006: I have started to fix the tutorial to be a real, beginning level tutorial. If you are having trouble with the site, it would be a good place to start.
- 24 April 2006: You may now click on a word in the word forms list and see the citations for just that word form (just as clicking on a before/after word in the before after list gives you just the citations with that word before or after).
- 19 April 2006: I have added a tutorial section to these Instructions. I will give there help on specific searches that people have run, or that they have asked me about. It will be updated regularly, hopefully.
- 19 April 2006: The program has been fixed to allow the use of grouping parentheses and vertical bars to signal alternation in regular expressions. This means, however, that you can no longer use the vertical bar for other purposes. To indicate you want the program to search for two words, use a comma instead.
- 6 April 2006: A space bar now means a space bar, so you can no longer type in two words separated by a space bar and have the program search for either of those words. To do that you must use the comma. A space bar between two words now tells the program to search for that exact phrase in that order.
- 5 April 2006: The part of speech femnoun has been omitted. It's function now takes place automatically when you choose 'noun'.
Tutorial- First Search: Look for a single noun
- Click on the red Instructions above the submit button to get these instructions in a separate window (otherwise they will go away as you follow them).
- Click on 'dt chart' to see a chart of the transliteration system in a separate window.
- Type mktb into the leftmost box. Do NOT type any vowels (i.e. do NOT type maktab). Do Not type the definite article (do NOT type Almktb).
- Choose 'noun' from the POS list.
- Choose Ahram99 from the corpus list.
- Click on 'Submit'.
- Wait about 10 seconds.
- Examine summary of search that appears, and notice how many examples it found, and how frequent these are in the corpus (per 100,000 words).
- Click on 'citations' in the dark red bar.
- Scroll down and look at a few of the citations. Scroll back to the top.
- Note that there are about 40 pages of results. Click on page 25.
- Note that each example gives you the word in context with 10 words before and 10 after.
- Note that the examples are organized by the word that comes before (here fy), and that this word is also listed at the beginning of each line so it can be easily picked out.
- Click on 'sort by word after' to sort the examples by the word after instead.
- Click on page 25 again, and see what kind of information you get with this order.
- Click on one of the numbers at the left, and see a new window open with even more context. Close that window
- Click on 'subsections' in the dark red bar.
- Examine the frequencies and relative frequencies of this word in the various sections of the Ahram.
- Click on 'word forms' in the dark red bar.
- Notice the different forms in which this word was found: alone, with wa-, bi-, fa-, with the definite article, and with various pronoun endings. Notice which of these forms were more common and which relatively rare.
- Click on مكتبك in the middle of the second column to see the citations just for that word form in a separate window. Examine briefly and close the second window.
- Click on 'words before/after' in the dark red bar.
- Examine the most common words that come before our search word and the most common words that come after. Any surprises? Is this what you would have predicted?
- Click on التنسيق in the second column to see the 200 examples of this word coming after our search word in a new window. After examining for a few moments, close the new window.
- Click on 'summary' in the dark red bar to go back to the summary page.
- Try other single words
- Type jmyl into the leftmost box, choose the 'adj' POS, the Ahram99 corpus, and click submit.
- Go through the various choices (as under First Search) and see the differences. Note particularly under 'word forms' that you get examples with and without the feminine ending, but no examples with pronoun endings or prepositions.
- Now try running the same word on the same corpus but with 'noun' POS chosen. Note under 'word forms' that you don't get the feminine forms, but you do get forms with prepositions and pronouns.
- Type tqrybA into the leftmost box, choose 'adv' POS and the Ahram99 corpus and run through the various options once you get the results.
- Type ktAbhm into the leftmost box, choose 'adv' POS and the Ahram99 corpus, and look to see what you got under word forms. 'adv' is a good POS choice when you are looking for a very specific form, and don't want to see other possible prefixes and suffixes.
- Type prb into the leftmost box, choose the 'verb' POS, the Ahram99 corpus, and run through the various options looking at the results. Note particularly the large number of forms under 'word forms'.
- Now run prb again, but this time choose the 'string' POS. Note under word forms that besides all the verb forms you got a moment ago, you are now getting a bunch of other things that happen to contain the sequence prb, including a bunch of typos.
- Account for variabilty in the corpus
- Type AwlAd into the leftmost box, choose 'noun' POS, and the Ahram99 corpus.
- Look through the various pages of results. Note on the word form page that you got no hits with a hamza on the beginning alif.
- Now type LwlAd into the leftmost box, choose 'noun' POS, and the Ahram99 corpus.
- Look through these results. Note that on the word form page that you got no hits without a hamza on the beginning alif.
- These results are mutually exclusive. But what if you want to see them all together, or what if you wanted to know how many times this word was used in the corpus, whether or not the hamza was written on the beginning alif?
- There are several ways to have the corpus look for alternates (either this OR that). The easiest involves square brackets.
- Type [AL]wlAd into the leftmost box, choose 'noun' POS, and the Ahram99 corpus.
- The square brackets tell the program to search for any one of the characters in the brackets. In this case it searches for either A or L, meaning that it is searching for either AwlAd or LwlAd.
- Look through these results. You can see from the total numbers, that the examples of AwlAd and LwlAd have been combined. You can see the same thing on the word forms page.
- In general, if you want ALL the examples of any form that begins with a hamza on an alif, you should use square brackets to have the program search for both possibilities.
- Now type lbnAne into the leftmost box, choose 'noun' POS, and the Ahram99 corpus.
- Notice that you get 20 hits that end with an alif maqsura. If you had searched for lbnAny and wanted to see all examples of use of this word, the program would have missed these 20.
- In general, if you want ALL examples of forms that end in y, you would need to use [ey] instead.
- The Ahram itself is a special case because for whatever reason they use the yaa' for the alif maqsuura much of the time. So if you search for mqhe (noun) in the Ahram99 corpus you get zero results, but if you search for mqhy you get 158, and examining their citations reveals that they are meant to be mqhe.
- Therefore, if the Ahram is included in your search, you should use [ye] for words that end in y AND e.
- Now type msWwl into the leftmost box, choose 'adj' POS and the Newspaper combined corpus. You will have to wait longer for the results because of the larger corpus.
- If you check out the subsections page, you will see that there are a huge number of hits in the other papers, but very few in the Ahram.
- Now type msYwl into the leftmost box, choose 'adj' POS, and the Newspaper combined corpus.
- Notice in the subsections page that almost all the examples are from the Ahram, and very few from the others.
- If you wanted to catch ALL instances of this word in the Newspaper combined corpus, you would need to type ms[WY]wl.
- Now type tlfwn into the leftmost box, choose 'noun' POS, and the Ahram99 corpus. Look how few the results are.
- Now type tlyfwn into the leftmost box, choose 'noun' POS, and the Ahram99 corpus. Look how many the results are.
- If you want to find tlfwn and tlyfwn together you could type tly?fwn. The question mark tells the program the yaa' can either be there or not.
- The program is really just an automoton. It doesn't know what you want. It just looks for exactly what you type in. If you need it to pay attention to spelling variation, then you must tell it to do so.
- Moral: PAY ATTENTION to spelling variations in Arabic, and account for them in your searches, often by using square brackets or question marks.
- Search for more than one word at once
- Type mktb,mkAtb into the leftmost box. Note that there is no space after the comma. Choose 'noun' POS and the Ahram99 corpus.
- Notice on the word forms page that the two words have been combined into a single search. You can string any number together with commas.
- Now try the following combinations:
- hjwm,hjmAt,hjwmAt,hjmQ,mhAjmQ (check out the word forms page to get an idea of relative frequency)
- bTbycQ AlHAl,TbcA,bAlTbc (ditto)
- hAtf,tlfwn,tlyfwn,mwbAyl,mHmwl,xlwy
- Filter results 'by hand'
- Type lbnAn into the leftmost box, choose 'noun' POS, and Ahram99 corpus.
- Examine the word forms. Notice that you have quite a few examples of lbnAny, which the program got by putting the y suffix for 'my' on the end of the noun.
- However, if you click on this word on the word forms page and examine these citations, you will see that none of them mean 'my Lebanon'. The y here is a nisba adjective suffix.
- In other words, you have gotten a bunch of things you don't want because of the morphological ambiguity of Arabic.
- Let's say you want to look through all the citations with lbnAn, and you don't want to have to deal with all those lbnAny's.
- Type lbnAn -- y$ into the leftmost box, choose 'noun' POS and Ahram99 corpus.
- The two dashes tell the program that whatever you type after them is NOT wanted (filter it out).
- The dollar sign means that this is at the end of the word.
- So this means, search for all examples of lbnAn with normal noun suffixes, but if you find one the ends in a y, filter it out.
- A look at the word forms list will convince you that it has now gotten rid of the unwanted lbnAny's.
- Try the following searches with and without the dash section and check the word forms page to see what it cuts out:
- mktb -- ^m (this will cut out all word forms that begin with a miim, i.e. all that don't have some other prefix)
- mktb -- h (this one will cut out all word forms that have a haa' in them, meaning that mktb with certain pronoun endings will be cut out.
- mktb -- h|y|n (this one will cut out all word forms with a haa', kaaf, yaa', or nuun, meaning that maktb with all but second person pronoun endings will be filtered out. Note that the vertical bar represents alternation.
- mktb -- h|y|n|km?A?$ (this one also cuts out the second person forms, but you have to be careful: if you just list kaaf by itself it will cut out everything, since kaaf is in the word itself. So you need the question marks and the dollar sign.
- A good strategy is to perform plain searches first, and then examine the word form list to see if there are any forms you would like to filter out by hand. If there are, create a 'dash' phrase to do the job.
- Note for those with a 'regular expression' background: You can consider the list after the dashes to be a regular expression for which the program automatically provides an outside set of grouping parentheses.
- Search for a Form IV or VIII verb
- Type [LA]krm into the leftmost box, choose 'verb' POS and Ahram99.
- Remember that the square brackets mean either this or that (i.e. either the alif with or without the hamza).
- Check total and the word forms. Notice that they are all basically past tense. Doesn't this verb have present tense forms?
- Type [LA]krm,krm (just like that with no space after the comma), choose 'verb2' POS and Ahram99.
- Notice you got a lot more hits from the total, and under word forms, notice that you get all the imperfect forms with the perfects now.
- Now type Antxb,ntxb and then choose verb2 POS and Ahram99. Again check word forms to see that you got both tenses.
- Now type wSl,Sl and then choose verb2 POS and Ahram99. Check word forms.
- Choosing plain verb, the program tries all the verbal prefixes and suffixes on the single stem you type in.
- This works for verbs that use the same graphical stem in the perfect and imperfect, but not for the others.
- The POS 'verb2' requires you to input 2 different stems, and it will try all the perfect suffixes on the first one, and all the imperfect prefixes and suffixes on the second.
- When searching for a verb, stop a moment and think: If I add the imperfect prefixes on exactly what I just typed in, is it going to work?
- If not, you probably should use 'verb2' or one of the other choices so you can tell the program exactly what the imperfect stem is.
- On the other hand, if you choose 'verb2' but just type in one stem (like [LA]krm), it will still only find the perfects.
- The ONLY way to find all the forms of Form IV, VII, VIII, IX, X and Form I assimilated verbs is to choose 'verb2' and type in TWO stems, separated by a comma.
- Search for a doubled or hollow verb
- Type zAr into the leftmost box, choose 'verb' POS, and Ahram99.
- Check out the word forms and notice that they are all either the 'he', 'she', and 'they' forms of the past tense, or present tense passives.
- (You will also notice ambiguous forms like wzArth, which COULD mean 'and she visited him' but which probably mean 'his ministry', but that is a different issue. See instructions on filtering results by hand if desired.)
- Notice that there are no past or present tense forms with the short stem, like zrt, and no present tense forms with the long uu stem like yzwr.
- Type zAr,zwr into the leftmost box, choose 'verb2' POS and Ahram99.
- Check out the word forms. This time you get all the long forms like zArwA and yzwr, but none of the short forms.
- Now type zAr,zr,zwr,zr into the leftmost box, choose 'verb4' POS and Ahram99.
- Notice in word forms that you now get the short forms as well, like zrt.
- When you choose 'verb4', the program checks for the perfect long form suffixes (he, she, they) on the first form you type in, for the perfect short form suffixes on the second form you type in (you, I, they-f), for the imperfect long form prefixes and suffixes on the third form you type in (all but they-f and you-pl-f), and for the imperfect short form prefixes and suffixes on the fourth form you type in (they-f and you-pl-f).
- If you only type in one form when you choose 'verb4' the program will only check for the perfect he, she, they suffixes and nothing else.
- Type mr,mrr,mr,mrr into the leftmost box, choose 'verb4' and Ahram 99.
- Notice under word forms that you get both the cases where one 'r' is written, and cases where two 'r's are written.
- In general, whenever you want to find all the forms of a verb, you need to think carefully about the various stems that the endings go on.
- Almost all verbs (except for defective roots) fit into one of three categories:
- a single (graphic) stem that all pefect and imperfect prefixes and suffixes are added to (choose 'verb')
- two stems, one used for all perfect suffixes, and the other for all imperfect prefixes and suffixes (choose 'verb2')
- four stems, one for perfect suffixes that begin with a vowel, the second for perfect suffixes that begin with a consonant, the third for imperfect suffixes that begin with a vowel, and the fourth for imperfect suffixes that begin with a consonant (choose 'verb4')
- The following categories of verbs fit these types:
- single stem ('verb'): Sound Form I, II, III, V, VI verbs
- two stems ('verb2'): Assimilated Form I, hamzated form II, Sound Form IV, VII, VIII, IX, X verbs
- four stems ('verb4'): Hollow verbs, Doubled verbs of all forms
- The program does not have any kind of dictionary lookup and it doesn't know what form verb you have typed in.
- All it does is look for the exact sequence of characters you type in, and check it with the prefixes and suffixes it knows for all verbs.
- Summary: the program will not do a good job of finding all forms of particular verbs unless YOU take responsibility to figure out the morphological category of the verb you are searching for, and search for it in a way appropriate to that category
- Search for a defective verb
- Type in dc,au into the leftmost box, choose 'verbd' POS and Hayat97.
- Check out the word forms. Note that you are getting most of the various stems of the verb dcA ydcw.
- Try the other types of defectives: bn,ai and lq,ia and tlq,aa
- The POS 'verbd' expects you to type a stem (without the long vowel at the end), a comma, and then the vowel pattern (perfect then imperfect).
- The following vowel patterns are recognized:
- au (for verbs like dcA ydcw)
- ai (for verbs like bne ybny, all Form VIII and X passives, and it can also be used for passive defective verbs (which are actually ui))
- ia (for verbs like lqy ylqe)
- aa (for defective Form V and VI verbs like tlqe ytlqe)
- There are several reasons why the results of searches like these are less useful and accurate than the other kinds of verb searches, but it is something.
- You can always use 'string' and search for particular forms, or use regular expressions to define exactly what you want to get.
- Search for an abstract form
- Note: Before trying these examples, remember that when a search produces more than 50,000 hits, it will sometimes balk and not return anything. When you are searching for patterns that are common, first choose a very small corpus to try it out on.
- Type m\wA\w\wQ into the leftmost box, choose noun POS, and the corpus Smell of the Sea (a short novel).
- This is the pattern for a form III Verbal noun, and so should find every form III VN in the text.
- \w stands for any Arabic character, so m\wA\w\wQ means something like mfAclQ (where fcl stand for any root).
- Check out the word forms and see the most common Form III verbal nouns in the novel.
- Now try the same search on the Ahram99 corpus.
- If you were able to get it to go through without balking, and if you had enough patience, you would discover under word forms:
- that mbArAQ and mpArkQ are the most common Form III verbal nouns in the Ahram
- that there 3208 different Form III verbal nouns word forms used in the Ahram
- that there are some words that fit our pattern but that are not Form III verbal nouns
- Now try finding some other abstract forms (go back to using Smell of the Sea or another short corpus):
- Type t\w\wy\w into the leftmost box, choose 'noun' POS, and Smell of the Sea (Form II verbal nouns).
- Type s[ytnLA]\w\w\w\w? into the leftmost box, choose 'adv' POS, and Smell of the Sea (most future singular verbs with sa- but without pronoun endings--note there are many false hits).
- Type \w\w\w+Q into the leftmost box, choose 'adv' POS, and Smell of the Sea (all feminine nouns at least 4 letters long).
- Type y\w\w[aui]\w into the leftmost box, choose 'adv' POS, and Hayat97 and click the include vowels button (all four letter words beginning with y-mostly verbs-that have a vowel marked on the next to last letter and no other vowels, or three letter words beginning with y with one other vowel marked).
- Type \w[aui]\w[aui]\w into the leftmost box, choose 'adv' POS, and Hayat97 and click the include vowels button (all three letter words with vowels marked on the first and second letter).
- Search for a phrase
- Type lA gbAr into the leftmost box, choose 'adv' POS, and Hayat97.
- Check out some of the citations.
- Try these other phrases:
- Type (Al)?bnyQ (Al)?tHtyQ into the leftmost box, choose 'string' POS and Hayat97.
- Type pyk bdwn rSyd into the leftmost box, choose 'string' POS and Hayat97 (one hit only, check out the citation)
- Type ttHlb lh Al[LA]fwAh into the leftmost box, choose 'string' POS and All (one hit only, check out the citation)
- Type \bwzArQ Al\w+ into the leftmost box, choose 'string' POS and Treebank (to see a list of different ministries)
- Try AlwzArQ Al\w+ (obviously not as useful)
- Type mA lbV [LA]n into the leftmost box, choose 'string' POS and Treebank.
- Try (mA lbV|lm ylbV) [LA]n with 'string' POS and Hayat97 (be prepared to wait)
- You are basically limited only by your imagination. However, if you want to be sure to get the results you want, remember to account for hamza variations and other spelling variations.
- Also remember that some writers do not put spaces where others do (some write lAbd and others lA bd). You can capture this with lA ?bd (try it with Treebank)
- Compare results of different corpora
- Type hAtf into the leftmost box, choose 'noun' POS and Newspapers.
- Go to subsections and see what papers use it more commonly than others.
- Now type tly?fwn into the leftmost box, choose 'noun' POS and Newspapers.
- Compare the subsection results with the ones you got for hAtif.
- Type lqd into the leftmost box, choose 'adv' POS and All.
- Note in subsections that the Quran and the novel have a much higher rate of use than the other corpora, and that the Ahram uses it more frequently that the Hayat.
- type shl into the leftmost box, choose 'adj' POS and Hayat97.
- Look in subsections and see which parts of the paper like this adjective more than others.
- Then try it with Ahram99 and see if the same pattern holds. Look at the words before words after page to see if you get any hints as to why this may be so.
- Try the same thing with hjwm (noun) and see if the subsection patterns make sense to you.
Questions/Problems (Click question to see answer)- Why do I want to filter my results?
- If you are trying to find examples of a particular word or construction, and the program finds hundreds of results of something else that happens to match what you are looking for (because of the morphological ambiguity of Arabic), it can be very time consuming and annoying to search through the citations one by one looking for the ones you really want. If you can figure out a way to filter out the 'bad' ones, you can save yourself many hours.
- You want to take care of filtering yourself, and bypass the POS filters.
- Choose 'string' and type a regular expression that matches what the POS filters do, or which varies them.
- \bgbAr\b will find ONLY the word gbAr with no prefixes or suffixes.
- \b[wf]?gbAr\b allows gbAr, wgbAr and fgbAr and nothing else.
- \b[wf]?(Al)?gbAr\b allows gbAr, wgbAr and fgbAr with and without the definite article.
- \b[wf]?(Al)?gbAr(h|hA|k|y)?\b allows all of the above and the singular pronoun endings.
- \b[ytnLA]ktb[wyA]?[nA]?\b is one way to find present tense verb forms, here without any other prefixes or suffixes.
- The program can deal with quite complex expressions efficiently, as long as parentheses are matched, so you can get quite a bit of control over what you are searching for if you want it.
- You want to find all examples of the verb twj, but looking at the results you realize that you have many instances of twjh, which could conceivably mean 'he crowned him' but which 99% of the time is in fact another verb: tawajjaha 'to head for'.
- Choose 'verb' and search for twj -- h. This will delete all the twjh's. You may miss a couple of 'he crowned him's' but it's worth it because you can now see that your results are mainly what you want and expected.
- You want to find all examples of the verb qAl, including passives.
- Searching for qAl,ql,qwl,ql with 'verb4' will give you all the actives. Searching for qyl,ql,qAl,ql will give you all the passives. Searching for q[Ay]l,ql,q[wA]l,ql will give you both all sorted together. Try this on a smaller corpus like the Treebank first, since it generates a huge number of hits.
- You want to find all examples of the verb Lyd ('to support).
- Searching for Lyd using 'verb' will miss all the imperfects and all the perfects where the hamza was not typed on the alif. Searching for [LA]yd will get the hamza/alif variation, but will still miss the imperfects. To get those you need to choose 'verb2' instead and type: [LA]yd,Wyd i.e. the perfect stem and the imperfect stem with a comma in between. This should give you what you are looking for.
- You want to get comparative statistics on use of the verb forms yumkinu and tumkinu when they have a feminine verbal noun subject, but when you search for tmkn (using 'adv') you get overwhelmed with things that are really tamakkana, tamakkun, tumakkinu and the like.
- This one is a good example of the inherent ambiguity of much of Arabic morphology, particularly graphemically. To get a statistic you can rely on, particularly if you don't have the stomach to actually read through hundreds of citations, you probably need to do what I call a 'limited search'. Instead of looking for all instances of tmkn, do a search for [yt]mkn \w+(Q|t(h|hA|k|hm|hmA|nA)) which will give you a list of all the words that end in a taa' marbuuta or a taa' followed by a pronoun ending after one of these verbs and look at the list of words before and after. Look down the list of words after and pick out (say) the top 25 (or 10 or 5) feminine verbal nouns that come directly after ymkn or tmkn, and make sure they 'feel' right (i.e. you are pretty sure that the nouns you are choosing really are likely to function as the subjects of these verbs. Then do a search for those forms only, once with ymkn before, and once with tmkn. You don't have an overall statistic, but you have a 'subset' statistic that you can trust. Sample search strings with 5 verbs I found would be:
- ymkn (AstfAd|tsmy|syTr|zyAd|mwAjh)(Q|t(h|hA|k|y|hm|hn|km|kn|nA|hmA|kmA))
- tmkn (AstfAd|tsmy|syTr|zyAd|mwAjh)(Q|t(h|hA|k|y|hm|hn|km|kn|nA|hmA|kmA))
- Comparing the totals you get on those two searches should give you a good start at figuring out the relative frequency.
- Search Word
site maintained by d. parkinson. contact him with any problems or suggestions.